SuperMemory-VQA

Introduction

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs.

To fill this gap, we introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. As illustrated in the figure below, SuperMemory-VQA emphasizes four critical properties missing from prior benchmarks: Comprehensive Memory Tasks, Long-Horizon Context, Multi-Evidence Reasoning, and Realistic Question Answers.

Comprehensive Memory Tasks

Covers practical personal, social, spatial, conversational, and timeline-oriented memory needs that arise in everyday egocentric recordings.

Long-Horizon Context

Questions grounded in recordings lasting hours and sometimes spanning days, exceeding the practical context limits of current VLMs.

Multi-Evidence Reasoning

Many questions link sparse evidence across disjoint moments — e.g. a spoken plan and a later action — demanding retrieval and temporal abstraction.

Realistic Question Answers

Ordered answer choices (correct > vague > wrong > unanswerable) test whether models avoid confident hallucination when evidence is missing.

Dataset

0 hours of egocentric data

0 participants

0 question-answer pairs

0 memory task categories

SuperMemory-VQA contains 52.9 hours of egocentric recordings of everyday activities, collected from ten participants wearing Gen 1 Meta Aria glasses. The recordings are multimodal: egocentric RGB video (1408×1408, 30fps), dual SLAM streams (640×480, 30fps), eye-tracking (320×240, 60fps), and 7-channel audio (48kHz). Participants recorded indoors and outdoors following generic scripts — cooking a recipe, playing board games, doing chores, and having conversations. Each participant contributed 3–12 hours across multiple sessions, with three participants recording across multiple days (up to two weeks).

Based on the recrodings, we construct 4,853 grounded QA pairs through a human-verified annotation pipeline that cover six commonly encountered memory tasks: Object and Location Memory, Intent Recall, Visual Scene Recall, Timeline Reconstruction, Conversational Memory, and In-Context Retrieval. Each question is posed as a multiple-choice item with ordered answer choices — accurate, vague, incorrect, and an explicit unanswerable option — so that safe abstention can be distinguished from confident hallucination.

Object & Location Memory

Recover the last known location and trajectory of objects across time and rooms.

Intent Recall

Recover stated or implied goals, reminders, and intended future actions.

Visual Scene Recall

Retrieve visual details such as text, screens, ingredients, manuals, and scene contents.

Timeline Reconstruction

Sequence events chronologically and reason over how actions unfolded.

Conversational Memory

Recall spoken facts, commitments, deferred answers, and mid-conversation corrections.

In-Context Retrieval

Combine current context with earlier facts and associations in the recording history.

Leaderboard

We evaluate two state-of-the-art frameworks for long-form video understanding, each illustrating a distinct strategy for managing long-horizon context, paired with a diverse set of open- and closed-source VLMs.

Video-RAG

A training-free, single-turn retrieval-augmented framework that augments a VLM with auxiliary text from the source video. Three databases are precomputed per session — ASR transcripts (WhisperX), OCR text (EasyOCR), and object detections on CLIP-selected keyframes (APE). At inference the VLM decomposes the query, each database is queried via FAISS over Contriever embeddings, and retrieved text is concatenated with 32 sampled frames.

EgoButler

Proposed alongside EgoLife and the closest in spirit to SuperMemory-VQA, EgoButler pairs EgoGPT (an omni-modal captioner) with EgoRAG, which recursively summarizes dense visual–audio captions into hour- and day-level digests to form a hierarchical memory bank. At query time it performs coarse-to-fine temporal localization, retrieving summaries first, then narrowing to clips.

Vision-Language Models

Qwen-3-VL 8B Qwen-3-VL 30B InternVL-3.5 8B InternVL-3.5 30B Gemma-4-E4B IT Gemma-4 31B Gemini-3-Flash Gemini-3.1-Pro GPT-5.4-mini GPT-5.4

Evaluation Metrics

Ans-F1

F1 of the binary decision of whether a question is answerable from the available evidence, or should receive the unanswerable option.

QA-Acc

Four-way multiple-choice accuracy: only the ground-truth correct option earns credit; vague, wrong, and incorrect abstention count as wrong.

QA-MRR

Mean reciprocal rank from the model's ordered scores, rewarding ranking the correct answer above vague, wrong, and unsupported alternatives.

🏆 Model Rankings

Answerability, accuracy, and ranking across two retrieval pipelines

How to read the scores: all metrics are reported on a 0–100 scale (higher is better). Rankings are by Video-RAG accuracy by default — click any column header to re-sort.

Metric Key

Ans-F1 Answerability F1 — detecting whether a question is answerable.

Acc. Accuracy — correct answer selection.

MRR Mean Reciprocal Rank of the correct answer.

Model	VR Ans-F1	VR Acc.	VR MRR	EB Ans-F1	EB Acc.	EB MRR

VR = Video-RAG pipeline, EB = EgoButler pipeline. Best value in each column is highlighted.

Findings

Our benchmarking results on state-of-the-art frameworks and VLM backbones reveal that existing systems remain far from reliable on real-world long-horizon memory tasks.

Key Findings

1Structured retrieval helps most. Video-RAG beats EgoButler on most metrics, with its largest gain in Ans-F1 (51.5% → 70.5%) — retrieval-augmented evidence is especially useful for deciding whether a question is grounded in the recorded memory.
2Closed-source is stronger, but size is not destiny. Closed models average 76.8% Ans-F1 / 53.6% QA-Acc under Video-RAG vs. 66.4% / 41.9% for open models, yet performance is not monotonic with scale — Gemini-3-Flash beats Gemini-3.1-Pro on every Video-RAG metric.
3Larger models can collapse on the wrong format. InternVL-3.5 30B drops to 28.5% Ans-F1 under EgoButler (vs. 61.4% for the 8B), failing to exploit its caption-based memory format.
4Excessive abstention dominates failures. The main error on answerable questions is abstaining when evidence is present; several open models wrongly abstain on more than 70% of answerable cases.
5The Ans-F1 / accuracy gap is the open challenge. Detecting answerability is only the first hurdle; coupling long-horizon retrieval with grounded reasoning to actually select the correct answer remains unsolved.

Task-Category Breakdowns

Video-RAG radar results — Video-RAG: task-category breakdown across model families.

EgoButler radar results — EgoButler: task-category breakdown across model families.

EgoButler, Video-RAG, and VideoAgent compared across the six task categories — EgoButler, Video-RAG, and VideoAgent (Gemini-3-Flash) across the six task categories. VideoAgent underperforms both baselines despite far higher token and compute cost.

Reliability analysis — Response reliability on answerable questions — failures are dominated by excessive abstention rather than wrong answers.

Long-Horizon Difficulty

Temporal robustness analysis — Temporal robustness: performance vs. the time gap between the query and its supporting evidence.

Complexity scaling analysis — Complexity scaling: performance as questions require more pieces of supporting evidence.

Participant Survey

Eight participants reviewed 18 questions generated from their own recordings and rated seven statements about question quality and utility. Responses were strongly positive, with no disagreement across statements.

86% agreed the questions captured genuine memory lapses, 82% found the answers useful during daily routines, and 78% agreed the underlying knowledge would also help answer future questions.

Participant survey results — Survey responses for QA quality (Likert).

Citation

@misc{alam2026supermemoryvqa,
  title         = {SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory},
  author        = {Alam, Samiul and Siam, Shakhrul Iman and Proulx, Michael J. and Fort, James and Newcombe, Richard and Kim, Hyo Jin and Zhang, Mi},
  year          = {2026},
  eprint        = {2606.00825},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.00825}
}