Personal AI Agent for Camera Roll VQA

Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

A benchmark and agent architecture for conversational reasoning over long-horizon personal camera rolls.

How can a conversational AI efficiently search and reason over a user's multi-year personal camera roll to answer specific, personalized questions?

Personal camera rolls contain thousands of images, yet current AI assistants treat them as flat, unorganized storage, making it impossible to answer complex, context-dependent questions like "What did I eat after the Space Shuttle launch?" The authors introduce a hierarchical memory structure that abstracts raw pixels into event-based summaries, paired with a domain-specific agent that uses a dedicated toolset to iteratively search, filter, and inspect visual evidence. On the new Camroll benchmark, this agent outperforms general-purpose baselines by achieving higher reasoning accuracy while consuming significantly fewer tokens through selective, multi-step retrieval.

Paper Primer

The Camroll dataset provides a standardized framework for long-horizon personal visual memory, containing 50 users, over 31,000 images, and 2,500 human-annotated questions. It captures the open-ended, personalized reasoning required to navigate fragmented visual streams, distinguishing between semantic memory (general facts) and episodic memory (event-specific recall).

The Camroll-agent uses a three-level pyramid of memory: raw pixels, personalized captions, and event summaries. The agent operates via a ReAct loop, using five tools—search, grep, list, get, and view—to navigate this hierarchy, prioritizing coarse semantic discovery before zooming into raw pixels only when necessary.

Camroll-agent significantly improves reasoning efficiency and accuracy over general-purpose agents.

The agent achieves a judge score of 4.11 using only ~3.2k tokens, compared to general-purpose agents that require ~59k tokens for similar performance. 18x reduction in token consumption for comparable reasoning performance.

Hierarchical memory structure is essential for performance.

Ablation studies show that removing event-level summaries or personalized captions leads to consistent performance drops, with caption removal causing the largest failure in episodic reasoning. Performance drops from 4.22 to 2.29 when captions are removed.

Why is a domain-specific agent necessary instead of just using a general-purpose coding agent?

General-purpose agents lack semantic indices and rely on exhaustive visual inspection, leading to inefficient token usage. Camroll-agent’s domain-specific tools allow it to allocate the majority of its budget to semantic retrieval, reducing the need for expensive raw-pixel views.

What is the scope of the Camroll dataset?

It covers 50 personal camera rolls spanning 2–6 years, including both in-house mobile captures and curated YFCC-100M data, designed to reflect the incidental, redundant, and highly personalized nature of real-world photo collections.

The agent's design is model-agnostic; the hierarchical memory and tool interface remain constant even when swapping the underlying LLM or captioner, allowing for modular upgrades as better models emerge.

For researchers building personalized AI, this paper establishes that treating images as first-class, structured entities—rather than reducing them to generic text—is the key to enabling reliable reasoning over long-term personal visual history.

The Personal Camera Roll VQA Setting

Introducing personal camera roll VQA and the challenges of long‑term visual memory.

We study the personal camera roll visual question answering setting, where an AI assistant can retrieve a user’s photos to answer queries ranging from factual to open‑ended. Because a typical roll spans years and contains thousands of images, the assistant must navigate a long‑horizon, highly personalized visual stream.

It asks an AI to answer questions by directly accessing a user’s personal photo collection rather than a generic image dataset.

How does Personal Camera Roll VQA differ from conventional image VQA?

Standard VQA treats each question as referring to one given image, while Personal Camera Roll VQA must search across a personal collection, reason about events, and incorporate the user’s history to produce a grounded answer.

Our camroll dataset comprises 50 users, 31,476 images, and 2,500 QA pairs collected to reflect real‑world usage. Surveys show smartphones accumulate about 3,139 photos per person, yet 65% of users take pictures to reminisce while 55% feel overwhelmed querying their rolls.

The sheer volume, redundancy, and chronological organization of personal rolls make naïve retrieval inefficient. Existing tools provide only basic similarity search and lack event‑based indexing, preventing compositional queries like “What did I eat after the Space Shuttle 135 launch?”

**Figure 1.** We study the VQA setting over the personal camera roll, where an AI assistant can search and retrieve relevant photos from thousands of user images, enabling more personalized responses.

The shift from static image VQA to long‑term personal visual history demands new memory and indexing strategies.

The Camroll Dataset

Camroll offers the first large‑scale, multi‑year benchmark for personal visual reasoning.

Camroll exhibits substantially higher answer diversity than existing VQA datasets.

Table 2 shows top‑10 % answer coverage of $32.0\%$ for Camroll versus $89.9\%$ for VQA and $65.9\%$ for LLaVA.

Across all top‑k thresholds, Camroll’s coverage remains far below that of VQA and LLaVA, confirming a heavy‑tailed answer distribution. This reflects the user‑specific nature of the questions, where each user’s roll contributes a distinct set of answers.

Camroll is a VQA dataset of personal camera rolls spanning multiple years per user, with human‑authored questions that probe both semantic and episodic memory.

How does Camroll differ from standard VQA datasets?

Camroll uses personal multi‑year photo streams, yields far lower top‑10 % answer coverage (32 % vs >65 % in others), and shows strong user‑specific patterns in both question embeddings and answer vocabularies.

**Figure 2.** Overview of camroll. Left: photos are captured across 25+ countries. Right: smartphone users (in-house subset) take substantially more images than digital camera users (YFCC-100M).

Camroll provides the first large‑scale, multi‑year benchmark for personal visual reasoning.

The Camroll Agent Architecture

Camroll-agent combines a hierarchical memory with dedicated tools to answer personal photo queries efficiently.

Processing a personal camera roll containing thousands of images is computationally prohibitive. Camroll-agent tackles this by structuring the data into a compact hierarchy and exposing it through a small suite of tools.

The memory is a three‑level pyramid that lifts raw pixels to captions and then to event summaries, preserving links between adjacent levels so the agent can jump up or down with a single lookup.

Image $I_1$ receives caption $c_1$ and is assigned event ID $ev_1$.

Image $I_2$ receives caption $c_2$; the builder issues an UPDATE, extending $e_1$ to include $I_2$ and rewrites its summary.

Image $I_3$ receives caption $c_3$; the builder issues an ADD, creating $e_2$ with event ID $ev_2$.

Image $I_4$ receives caption $c_4$; the builder issues a `NO_OP`, appending $I_4$ to $e_2$ without changing the summary.

Lookup: given $I_3$, a single hash read yields $ev_2$, and a reverse index fetches $\{I_3,I_4\}$.

The hierarchy lets the agent retrieve a high‑level event from any image in constant time, while still preserving the ability to drill down to the raw pixels when needed.

How does this pyramid differ from a flat key‑value store of images?

In a flat store each image is isolated; the pyramid adds two abstraction layers that aggregate captions and events, enabling the agent to reason at the appropriate granularity without scanning all $N$ images.

When a new image arrives, the system decides whether it starts a new episode, extends the current one, or adds nothing new, using three explicit actions.

Why not simply cluster images by visual similarity instead of using these three actions?

Clustering ignores temporal continuity and user‑specific semantics; the three actions explicitly respect chronological order and allow the system to incorporate metadata (date, location) that pure visual similarity would miss.

The agent runs a ReAct loop that alternates between reasoning steps and calls to a small toolbox, each call consuming a modest token budget.

How is this tool‑driven loop different from a standard retrieval‑augmented generation pipeline?

Standard pipelines issue a single retrieval before generation; Camroll-agent interleaves reasoning and tool calls, allowing it to adaptively refine its search based on intermediate observations and respect a strict tool‑budget.

Initialize the system prompt with the memory schema and tool descriptions.

Receive a user question.

Generate a thought; if the thought requires external information, issue a tool call (search/grep/list/get/view) with appropriate arguments.

Append the tool’s textual output to the interaction history and update the remaining tool budget.

Repeat thought‑generation and tool‑call steps until a final answer is produced or the step budget is exhausted.

ReAct loop for Camroll-agent (simplified).

**Figure 3.** Hierarchical memory for personal camera rolls, organized from low-level visual pixels ($I$) to higher semantic abstractions (captions $C$, events $E$). Agent interactions are designed accordingly, ranging from expensive tool (view, get) to cheaper one (search, grep, list).

Experimental Evaluation

Camroll-agent delivers top accuracy while using far fewer tokens than baselines.

Camroll‑agent achieves 70.5 % multiple‑choice accuracy, the highest among all evaluated methods.

Table 3 reports an Acc of 70.5 % for camroll‑agent, beating the next best baseline by 2.5 %.

**Table 1.** Comparison of different methods for memory building and retrieval performance.

**Figure 5.** ClaudeCode vs. camroll-agent tool call distributions.

Camroll‑agent outperforms flat retrieval baselines by leveraging event‑based semantic indexing.

Error Analysis and Tool Usage

Personal photo collections are massive; hierarchical memory makes them searchable.

Personal camera rolls contain millions of images, making naïve visual processing infeasible. The hierarchical memory structure indexes photos by semantic captions and temporal events, enabling efficient retrieval and reasoning.

Our experiments show that hierarchical memory, iterative retrieval, and domain‑specific tool use are essential for long‑horizon visual reasoning. We did not train a dedicated end‑to‑end memory agent; the current system relies on modular components.

Broader impacts revolve around privacy: personal photo collections embed highly sensitive information (identities, locations, daily activities). Exchangeable Image File Format (EXIF) metadata—e.g., GPS coordinates and timestamps—can leak such details if not protected. Deployments must enforce user consent, controllable memory management, secure storage, and privacy‑preserving mechanisms.

The dataset spans 24 years, five continents, and roughly 25 countries. The in‑house subset (2023–2026) is about 1.6× denser than the older YFCC subset, reflecting the rapid accumulation of photos on modern smartphones.

Question analysis reveals that episodic queries are on average twice as long as semantic ones and that more than half of all questions require reasoning across multiple images. Answer analysis shows answers are short (median 2 tokens) but highly personalized—most answer tokens appear in only a single user’s roll.

Future work should explore learning‑based retrieval, joint training of memory and reasoning components, and stronger privacy‑preserving personalization techniques.

**Table 4.** Error analysis on incorrect questions.

**Figure 4.** Tool-call distributions across turns and question types.

**Table 7.** Evidence coverage across dataset subsets and memory types.

Dataset Composition and Schema

Supplementary data detailing dataset composition and question schemas.

**Table 9.** Composition of CAMROLL. The two subsets are complementary: the in-house subset captures contemporary smartphone behavior at full resolution with rich participant-authored event labels, while YFCC contributes longer per-user spans, real EXIF/GPS metadata, and a publicly redistributable license at lower resolution. *Encoded in the filename (YYYY-MM-DD HHMMSS.jpg); †encoded in the YFCC100M datetaken metadata field; ‡YFCC also contains a smaller fraction of early-smartphone captures (e.g., iPhone 4).

**Table.** Scoping constraints used to categorize questions.

**Table 11.** Condition-type schema. Conditions are the constraints in the question that scope the search (which photo / event / email to look at), separately from the question type (what attribute to extract). A single question may carry zero, one, or several conditions; counts therefore sum to more than 1,500. The condition vocabulary is intentionally distinct from the question-type vocabulary so the two slots are not conflated.

Read the original paper

Open the simplified reader on Paperglide