Introduction
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs.
To fill this gap, we introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. As illustrated in the figure below, SuperMemory-VQA emphasizes four critical properties missing from prior benchmarks: Comprehensive Memory Tasks, Long-Horizon Context, Multi-Evidence Reasoning, and Realistic Question Answers.
Covers practical personal, social, spatial, conversational, and timeline-oriented memory needs that arise in everyday egocentric recordings.
Questions grounded in recordings lasting hours and sometimes spanning days, exceeding the practical context limits of current VLMs.
Many questions link sparse evidence across disjoint moments — e.g. a spoken plan and a later action — demanding retrieval and temporal abstraction.
Ordered answer choices (correct > vague > wrong > unanswerable) test whether models avoid confident hallucination when evidence is missing.
Dataset
SuperMemory-VQA contains 52.9 hours of egocentric recordings of everyday activities, collected from ten participants wearing Gen 1 Meta Aria glasses. The recordings are multimodal: egocentric RGB video (1408×1408, 30fps), dual SLAM streams (640×480, 30fps), eye-tracking (320×240, 60fps), and 7-channel audio (48kHz). Participants recorded indoors and outdoors following generic scripts — cooking a recipe, playing board games, doing chores, and having conversations. Each participant contributed 3–12 hours across multiple sessions, with three participants recording across multiple days (up to two weeks).
Based on the recrodings, we construct 4,853 grounded QA pairs through a human-verified annotation pipeline that cover six commonly encountered memory tasks: Object and Location Memory, Intent Recall, Visual Scene Recall, Timeline Reconstruction, Conversational Memory, and In-Context Retrieval. Each question is posed as a multiple-choice item with ordered answer choices — accurate, vague, incorrect, and an explicit unanswerable option — so that safe abstention can be distinguished from confident hallucination.
Recover the last known location and trajectory of objects across time and rooms.
Recover stated or implied goals, reminders, and intended future actions.
Retrieve visual details such as text, screens, ingredients, manuals, and scene contents.
Sequence events chronologically and reason over how actions unfolded.
Recall spoken facts, commitments, deferred answers, and mid-conversation corrections.
Combine current context with earlier facts and associations in the recording history.
Examples
Leaderboard
We evaluate two state-of-the-art frameworks for long-form video understanding, each illustrating a distinct strategy for managing long-horizon context, paired with a diverse set of open- and closed-source VLMs.
A training-free, single-turn retrieval-augmented framework that augments a VLM with auxiliary text from the source video. Three databases are precomputed per session — ASR transcripts (WhisperX), OCR text (EasyOCR), and object detections on CLIP-selected keyframes (APE). At inference the VLM decomposes the query, each database is queried via FAISS over Contriever embeddings, and retrieved text is concatenated with 32 sampled frames.
Proposed alongside EgoLife and the closest in spirit to SuperMemory-VQA, EgoButler pairs EgoGPT (an omni-modal captioner) with EgoRAG, which recursively summarizes dense visual–audio captions into hour- and day-level digests to form a hierarchical memory bank. At query time it performs coarse-to-fine temporal localization, retrieving summaries first, then narrowing to clips.
F1 of the binary decision of whether a question is answerable from the available evidence, or should receive the unanswerable option.
Four-way multiple-choice accuracy: only the ground-truth correct option earns credit; vague, wrong, and incorrect abstention count as wrong.
Mean reciprocal rank from the model's ordered scores, rewarding ranking the correct answer above vague, wrong, and unsupported alternatives.
Answerability, accuracy, and ranking across two retrieval pipelines
| Model | VR Ans-F1 | VR Acc. | VR MRR | EB Ans-F1 | EB Acc. | EB MRR |
|---|
Findings
Our benchmarking results on state-of-the-art frameworks and VLM backbones reveal that existing systems remain far from reliable on real-world long-horizon memory tasks.
Participant Survey
Eight participants reviewed 18 questions generated from their own recordings and rated seven statements about question quality and utility. Responses were strongly positive, with no disagreement across statements.
Citation
@misc{alam2026supermemoryvqa,
title = {SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory},
author = {Alam, Samiul and Siam, Shakhrul Iman and Proulx, Michael J. and Fort, James and Newcombe, Richard and Kim, Hyo Jin and Zhang, Mi},
year = {2026},
eprint = {2606.00825},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.00825}
}