Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

A unified taxonomy for video MLLMs that organizes perception, memory, and reasoning into a functional human-view framework.

How can we organize the rapidly evolving landscape of MLLM-based video understanding into a coherent framework of perception, memory, and reasoning?

Video understanding systems struggle to balance the high redundancy of long-form video with the need to capture sparse, decisive evidence for complex reasoning. The authors propose a "watch-remember-reason" taxonomy: a functional framework that decomposes video understanding into selective perception, long-term context retention, and evidence-grounded inference. This structure maps diverse existing methods—from token-level compression to agentic tool use—into a single pipeline, providing a roadmap for building scalable, faithful video intelligence.

Paper Primer

The paper frames video understanding as a progression of three functional abilities: watching (extracting task-relevant evidence), remembering (maintaining context over time), and reasoning (deriving conclusions from evidence). This approach moves the field away from isolated task-specific benchmarks toward a unified system design that mimics human cognitive processes.

The "watch-remember-reason" taxonomy effectively categorizes the frontier of video MLLM research.

The authors map representative methods across fine-grained temporal grounding, streaming memory, and agentic reasoning, identifying that current bottlenecks are primarily due to the tension between evidence sparsity and computational redundancy. The framework integrates diverse techniques—including token merging, hierarchical memory, and reinforcement learning-based post-training—into a single coherent formulation.

Why is a new taxonomy needed when existing surveys already cover video-language models?

Existing surveys are typically organized around specific tasks or training paradigms, which obscures the functional relationship between perception, memory, and reasoning. This paper’s taxonomy integrates these components to address the specific challenges of long-video understanding, such as evidence faithfulness and long-range dependency management.

What is the core mechanism for "watching" in long videos?

Watching is realized through selective perception: models use query-aware frame selection, token-level compression, or adaptive resolution to filter out redundant visual information before it enters the model's context window.

Introduction to Video MLLMs

We define the watch‑remember‑reason taxonomy and introduce video MLLMs for long‑form video understanding.

Multimodal large language models (MLLMs) are rapidly reshaping video understanding, extending LLM pre‑training to process vision, audio, and text. Early work focused on short clips, but current systems must handle minutes‑long videos with sparse, distributed evidence and strict compute budgets. This shift creates three core challenges: selective perception, long‑range memory, and faithful reasoning.

We adopt a human‑view perspective that organizes video MLLM systems into three functional abilities—Watch, Remember, and Reason. These stages mirror how people watch a long video, keep salient events in memory, and integrate evidence to answer questions. The taxonomy unifies diverse methods and highlights the need for joint design of perception, memory, and reasoning.

A Video MLLM is a multimodal model that ingests raw video frames, audio tracks, and optional text, then leverages a large language model to generate language‑grounded outputs.

Our survey makes three contributions: (1) the watch‑remember‑reason taxonomy, (2) a comprehensive mapping of recent video MLLM techniques to this taxonomy, and (3) a systematic summary of training datasets, evaluation benchmarks, and application domains. We also identify open problems such as efficient long‑video processing, memory compression, and evidence‑grounded reasoning. Table 1 contrasts prior surveys and shows how our taxonomy uniquely integrates perception, memory, and reasoning.

The field is moving from short‑clip classification to long‑form, knowledge‑intensive video understanding.

The Watch-Remember-Reason Framework

We map video tasks onto Watch, Remember, and Reason to build a unified taxonomy.

The field lacks a single map for video tasks. We therefore organize video understanding into three cognitive stages—Watch, Remember, and Reason—mirroring human processing. This taxonomy underpins the rest of the survey.

We group every video‑understanding capability into one of three functional camps: Watch extracts perceptual evidence, Remember curates and stores it over time, and Reason draws conclusions from the accumulated knowledge.

Perceptual stage that extracts visual and auditory cues, aligns modalities, and selects salient evidence for downstream processing.

Memory stage that accumulates and compresses information over time, preserving long‑range context while discarding redundancy.

Inference stage that integrates perceptual evidence and memory to produce task‑specific outputs, often with explicit grounding.

**Fig. 1:** Overview of our survey. **Left:** the survey pipeline. **Right:** our *Watch-Remember-Reason* taxonomy for MLLM-based video understanding. **Watch** (Sec. 3.1) covers fine-grained grounding, captioning, audio-visual perception, and efficient processing. **Remember** (Sec. 3.2) includes offline and streaming memory. **Reason** (Sec. 3.3) covers text-only reasoning and thinking with videos, with both agentic and non-agent approaches. Representative methods are listed under each leaf.

**Table 1.** Comparison of survey scopes under a unified taxonomy. TG&SG denotes temporal and spatial grounding. Cap denotes video captioning. Omni denotes joint understanding across vision, audio, and language. Efficiency denotes efficient video processing. Off-Mem denotes offline memory modeling. Streaming-Mem denotes online memory mechanisms. Text-R denotes textual reasoning. O3-R denotes o3-like video reasoning (thinking-with-videos). Subfields denotes coverage of domain-specific subfields. Train-Data denotes coverage of training datasets. Bench denotes coverage of evaluation benchmarks.

Perceptual Foundations: Watching

This section maps video perception methods into four distinct watching camps.

We organize video perception methods into four watching camps, each emphasizing a different aspect of visual processing.

Watch is the stage where a video MLLM converts raw frames and audio into structured visual tokens that downstream memory and reasoning components can consume.

Focuses on precise spatio‑temporal grounding of events and objects within long videos.

Captures high‑level semantics such as captioning, summarization, and hierarchical description.

Integrates audio and visual streams to produce coherent multimodal perception.

Optimizes redundancy and scalability for long‑form video understanding.

**Table 1.** Summary of existing video-language models categorized by their primary focus: Fine-grained Watching, Comprehensive Watching, Audio-Visual Watching, and Efficient Watching.

Techniques for Visual Perception

Three efficient watching strategies—frame selection, token compression, and model-level tricks—are compared.

Recent video‑LLM pipelines improve caption quality (ShareGPT4Video, Panda‑70M, Vript, LLaVA‑Video‑178K) and add controllable or intent‑oriented captioning (IF‑VidCap, AnyCap). Dense Video Captioning unifies event detection, timestamp prediction, and sentence generation, with early two‑stage pipelines (Krishna et al.) evolving into end‑to‑end transformers (Masked Transformer, PDVC) and token‑based time encoding (Vid2Seq).

Region‑level captioning extends this to spatially localized targets, exemplified by VideoRefer, PixelRefer, Omni‑RGPT, and flexible interfaces (DAM, CAT‑V, PAM). Audio‑Visual Watching now incorporates speech and environmental sounds, with omni‑modal models (Omni‑Captioner, OmniVinci, Omni‑R1, Ming‑Omni, Megrez‑Omni) and synchronization tricks such as TMRoPE and OmniAlignNet.

Efficient Watching tackles the memory bottleneck of long videos through three orthogonal camps: frame‑level selection, token‑level compression/merging, and model‑level processing tricks. Each camp balances query awareness, granularity, memory reduction, and computational overhead differently.

Methods that filter or sample whole frames or clips before encoding, often guided by the user query.

Techniques that merge or prune individual visual tokens, often exploiting temporal similarity.

Architectural modifications that skip or sparsify computation for redundant tokens, often via cache or attention tricks.

Memory and Retrieval

Memory architectures for long‑video models are organized into three distinct camps.

The “Remember” stage supplies models with mechanisms to compress, store, and retrieve information from arbitrarily long video streams.

Memory lets a video model keep a concise representation of what it has already seen and pull the right pieces back when later reasoning needs them.

Memory constructed and queried by the model itself through multi‑round reasoning and tool use.

Memory built by a fixed sequence of modules that extract, compress, and cluster visual information.

Memory that updates continuously as new video frames arrive, maintaining a rolling STM and a hierarchical LTM.

**Fig. 3:** Overview of methods related to "How to Remember?". Agentic offline memory constructs and updates external memory through LLM/VLM agents. Non-agentic offline memory builds structured short-term and long-term memory via event extraction, frame selection, token compression, and event clustering. Streaming memory maintains and retrieves memory online through sliding windows, recent memory, and long-term memory banks.

**Table.** Summary of video memory methods categorized by their architectural approach (Agentic, Non-Agent, and Streaming Memory), including the model name, publication venue, training paradigm, and key highlight.

Reasoning Over Video

We map the four reasoning camps and define the Reason stage that drives inference.

A central challenge is to let models reason over what they have watched and remembered.

Reason is the stage where the model turns perceptual observations and stored memories into conclusions, using logical operations and world knowledge.

How does Reason differ from a plain chain‑of‑thought prompt?

Plain chain‑of‑thought lets the model produce a single monologue, whereas Reason explicitly separates hypothesis generation, memory retrieval, and logical evaluation into distinct, controllable modules.

An MLLM agent orchestrates modular steps—clip summarization, adaptive search, memory retrieval, reflection, and answer verification—to solve a query.

A single forward pass of the MLLM produces a textual chain‑of‑thought ending with the answer, without external tool calls.

An MLLM agent interacts with video content through tools such as spatio‑temporal zoom‑in, producing augmented clips or frames for downstream reasoning.

A single grounded MLLM forward pass directly incorporates visual evidence (timestamps, boxes, captions) into a chain‑of‑thought answer.

**Fig. 4:** Overview of methods related to "How to Reason?". Agentic text-only reasoning methods decompose reasoning into modular steps such as clip summarization, adaptive search, memory retrieval, reflection, and answer verification. Non-agent text-only reasoning methods perform a single MLLM forward pass and produces textual chain-of-thought with the final answer. Agentic thinking with videos methods actively interact with videos through tools like spatio-temporal zoom-in. Non-agent thinking with videos methods directly ground reasoning in visual evidence, such as timestamps, boxes, and captions, within a single grounded MLLM forward pass.

**Table.** Summary of video reasoning methods categorized by their approach (Agentic vs. Non-agent) and reasoning type (Text-only vs. Thinking with Videos).

Domain-Specific Video Understanding

We survey major video subfields and recent multimodal advances.

Egocentric video understanding shifts from passive third‑person observation to first‑person embodied engagement, demanding models that infer intentions and 4D spatio‑temporal dynamics from the wearer’s viewpoint.

Foundational models such as Anticipative Video Transformer, TimeSformer, and EgoVLP/v2 enable fine‑grained grounding, while EgoMask provides pixel‑level spatio‑temporal benchmarks and DMC3 introduces counterfactual contrastive learning.

RL‑based approaches ST‑Think and VLN‑R1 advance 4D world modeling and vision‑language navigation; Ego‑R1’s Chain‑of‑Tool‑Thought handles week‑long contexts, and proactive systems VideoLLM‑EyeWO, EgoSocial, and safety benchmark DVBench round out the subfield.

Sports video understanding must cope with rapid actions, frequent camera cuts, and domain‑specific rules, leading to datasets like SPORTU and Unisoccer and methods such as DeepSport’s evidence‑refinement loop and FineQuest’s knowledge‑graph grounding.

Instructional video research targets knowledge acquisition, aligning dense on‑screen text with narration and extracting procedural structures, exemplified by benchmarks Video‑MMMU / Video‑MMLU, InstructionBench, DocVideoQA, and systems NoteIt and InsTALL.

Medical video work progresses from task‑specific surgical phase recognition toward multimodal VLMs such as SurgVLP, SurgVISTA, MM‑OR, and specialized LLMs (Surgical‑LLaVA, SurgVLM, SurgVidLM, EndoChat, LLaVA‑Surg), while ultrasound streams are addressed by EchoCLIP, MMSummary, and Sonomate.

Movie and narrative video research moves from clip‑level QA (MovieQA, MovieNet, MAD, MoVQA, MovieChat) to story‑centric reasoning with benchmarks SFD, SCVBench, VRBench, SeriesBench, Cinéaste, MovieCORE, and narrative structuring approaches StoryCoT, PC‑DCoT, and ARC‑Chapter.

Evaluation Benchmarks

Video understanding now spans perception to cognition, organized into Watch, Remember, and Reason stages.

Earlier we noted that video understanding is moving from short‑clip classification toward long‑form, knowledge‑intensive reasoning. This section maps the emerging benchmark landscape onto the three cognitive stages.

Benchmarks that test holistic perception across short, medium, and long video durations, typically using multiple‑choice questions.

Benchmarks that require reasoning over multiple frames, capturing motion, order, and spatial relationships.

Benchmarks that combine temporal, numerical, and counterfactual reasoning, frequently providing chain‑of‑thought traces.

Benchmarks that evaluate models on videos lasting minutes to hours, testing retrieval, topic reasoning, and continuous interaction.

Benchmarks that probe expert‑level understanding in scientific, medical, or engineering video domains.

Benchmarks that require joint audio‑visual reasoning, often with generation or dialogue components.

**Table 6.** Representative Video Understanding Benchmarks (Section 5.2). Type: MCQ (Multi-Choice), OE (Open-Ended), Gen (Generation), Chat (Dialogue). Scale: Number of QA pairs, videos, or annotations.

**Table 6.** A comprehensive overview of representative benchmarks.

The field is moving from simple perception metrics toward rich cognitive benchmarks that evaluate Watch, Remember, and Reason capabilities.

Future Research Directions

Key research avenues to extend video MLLMs toward robust, long‑term, and interactive understanding.

Video MLLMs have moved from short‑clip classification toward long‑form, knowledge‑intensive reasoning, organized around the three cognitive stages—Watch, Remember, and Reason.

Spatial reasoning remains a bottleneck: models can describe a scene but often miss fine‑grained object locations, relationships, and consistent 3‑D layout across frames.

Emerging work tackles this gap by (1) adding dedicated visual encoders for sharper object‑level perception, (2) introducing explicit spatial representations that fuse multi‑view cues into a coherent global layout, and (3) applying structured reasoning—e.g., chain‑of‑thought prompting or step‑by‑step query decomposition—to infer geometry without architectural changes.

Temporal grounding must also expand beyond a single clip: real‑world use cases involve collections of videos, highlights, and edited segments, requiring the model to locate evidence across multiple timelines.

A promising direction treats multi‑video grounding as a set‑based retrieval followed by refinement: first retrieve candidate segments from any video, then fine‑tune start/end boundaries; edit‑aware cues such as cut points or segment IDs can serve as anchors, while uncertainty‑aware inspection lets the model ask for more evidence when confidence is low.

Standardizing evidence schemas—timestamps, bounding boxes, and grounded captions—will make training and evaluation more consistent across tasks.

Hour‑scale video understanding demands stronger, structured memory: a three‑tier design (short buffer for recent fine‑grained evidence, event memory for temporally bounded episodes, and a long‑term store for entities and relations) can preserve rare but decisive moments.

Learning when to write, update, or forget ensures the system retains important events while discarding redundancy; retrieval should return both a concise summary and the supporting time spans so the model can re‑examine evidence on demand.

Efficient, verifiable reasoning can be framed as a budgeted evidence search that jointly optimizes answer correctness, evidence alignment (temporal/spatial IoU), and evidence compactness, using verifier‑guided preference optimization or verifiable reinforcement learning.

Streaming egocentric video adds continuous, first‑person viewpoints and rapid interaction; a stateful, goal‑driven memory that stores interaction episodes as structured records and proactively retrieves past evidence can support timely, safe interventions.

Evaluating such systems should go beyond static QA to measure timing, stability under updates, and safety of interventions, reflecting the real‑time demands of embodied assistants.

Training Datasets

Survey of large‑scale video‑MLLM training data across four task families.

We organise the publicly available video‑MLLM corpora into four task families—Video QA, Video Captioning, Video Temporal Grounding, and Long Video Memory—so that readers can see how supervision has evolved from short‑answer clips to multi‑round, tool‑augmented reasoning.

**Table 5.** Representative training datasets for video MLLMs (Sec. 5.1). "Scale" refers to the number of (video clip, text) pairs by default, and marked entries report the number of videos instead.

Across all families the community has moved from task‑specific, manually curated benchmarks toward unified instruction‑tuning corpora that blend multiple modalities and support chain‑of‑thought or tool‑use reasoning.

Read the original paper

Open the simplified reader on Paperglide