OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu

OVO-S-Bench evaluates how multimodal LLMs perceive, remember, and reason about space in continuous egocentric video streams.

How do current multimodal LLMs perform on streaming spatial reasoning tasks that require maintaining context over long egocentric video sequences?

Multimodal agents must reason about spatial structure from continuous egocentric streams, yet existing benchmarks evaluate offline over full videos or target event memory rather than spatial persistence. The authors introduce OVO-S-Bench, a human-annotated benchmark of 1,680 questions across 348 videos that enforces a strict streaming protocol where models only access the prefix preceding each query. Across 38 systems, the best model trails human experts by 27 points, with allocentric mapping serving as the primary performance bottleneck.

Paper Primer

OVO-S-Bench organizes spatial intelligence into a four-level taxonomy: instantaneous egocentric perception (L1), spatiotemporal context tracking (L2), spatial simulation (L3), and allocentric mapping (L4). The benchmark requires models to maintain a persistent spatial state, as evidence for higher-level tasks often disappears from the current view long before the query timestamp.

Allocentric mapping is the dominant bottleneck for streaming spatial intelligence.

L4 is the lowest-scoring level for 28 of 34 systems, with accuracy dropping sharply once models must abstract a global map from the explored region. Even the top-performing Gemini-3.1-Pro trails human streaming performance by 27 points overall.

Specialized streaming architectures and spatial fine-tuning methods consistently underperform their own base backbones. These methods are typically optimized for narrative QA or discrete-frame tasks, failing to sustain the persistent spatial state required for continuous egocentric streams.

Why does this benchmark use a streaming protocol instead of allowing models to re-scan the full video?

Real-world agents like household robots or AR assistants must reason about space as it unfolds; allowing offline re-scanning masks the difficulty of maintaining persistent spatial memory over time.

Does chain-of-thought reasoning help models overcome these spatial limitations?

It is double-edged: while it aids cross-frame integration for L2 tasks, it degrades current-view perception (L1) and amplifies hallucinations when the reasoning is not grounded in the video stream.

Introduction to Streaming Spatial Intelligence

We expose the gap in static benchmarks and introduce OVO‑S‑Bench for streaming spatial intelligence.

Multimodal agents that act in the physical world must reason about space from a continuous egocentric video stream. Static image or short‑clip benchmarks cannot test the ability to retain, update, and query spatial structure over time, which is essential for robotics, AR, and autonomous driving.

It is the capability to maintain and reason about a scene’s layout while the camera moves, using only the past video prefix to answer queries that may depend on unseen future frames.

How does Streaming Spatial Intelligence differ from traditional static‑image spatial reasoning?

Static reasoning can look at the entire image at once, while Streaming Spatial Intelligence must infer spatial relations from a growing history of frames, never seeing future content. It therefore requires memory, incremental updates, and query answering under a prefix‑only constraint.

Gemini‑3.1‑Pro trails human experts by 27 points on OVO‑S‑Bench.

Gemini‑3.1‑Pro scores 59.2 while humans achieve 86.6 overall.

**Figure 1.** Overview of OVO-S-Bench. The benchmark evaluates streaming spatial understanding across four levels, from instantaneous egocentric perception and spatiotemporal context tracking to generative spatial reasoning and global topological mapping. The right panel summarizes representative model behavior across task families.

The key shift is moving from static image benchmarks to continuous egocentric streaming for spatial understanding.

Benchmark Landscape and Taxonomy

Survey of prior benchmarks, embodied settings, and models relevant to streaming spatial intelligence.

Prior work on visual spatial understanding falls into three strands: static image benchmarks, video benchmarks that assume offline access, and emerging streaming benchmarks that enforce causal input.

Extends spatial evaluation to multi‑image inputs, requiring models to fuse several viewpoints before answering.

Embodied question answering that requires episodic memory and active exploration in a simulated environment.

Streaming video benchmark that enforces causal input and bounded memory, targeting event detection and counting.

**Table 1.** Comparison with spatial and streaming video benchmarks. Stream indicates a prefix-only protocol at query time; Avg. Query Time equals full video duration for offline benchmarks and the mean query timestamp (prefix length) for streaming ones (“N/A” if not open-sourced; OST-Bench is reported in frames as it releases only sampled frames). L1–L4 follow the taxonomy in Section 3; ✓ marks tasks at that level. OVO-S-Bench is the only entry with all four levels checked under a streaming protocol; per-benchmark discussion in Appendix F.

Benchmark Design and Protocol

Defines the streaming protocol, dataset splits, and model families evaluated on OVO‑S‑Bench.

The evaluation follows a strict streaming protocol: models only observe video frames up to the query moment, never the future.

The protocol guarantees that every model answers based solely on the video prefix available at the query timestamp, preventing any look‑ahead leakage.

What would happen if frames after $t_q$ were included in the input?

Including post‑query frames would leak future information, giving the model an unfair advantage and breaking the streaming‑only premise of the benchmark.

Locate the query timestamp $t_q$ in the source video.

Truncate the video at $t_q$ to obtain the prefix.

Uniformly sample 128 frames from the prefix.

Package the frames with the question and its multiple‑choice options.

For streaming‑architecture models, feed the video at the model’s published streaming rate and query the resulting compressed state.

Run the model (via API or locally) and extract its answer using a regular‑expression parser.

Compare the extracted answer to the ground‑truth option to compute accuracy.

**Figure 2.** Representative OVO-S-Bench examples. Each card pairs a spatial question with visual evidence, illustrating the progression from current-view perception to allocentric mapping.

**Figure 3.** Taxonomy and benchmark statistics for OVO-S-Bench. The left panel gives the four-level spatial taxonomy, while the right panels report task-family counts, source distribution, and evidence-interval lengths by level.

All 38 systems are thus evaluated under a uniform streaming regime, enabling direct comparison of their ability to reason over continuous egocentric video streams.

Performance Benchmarks

Key performance gaps across models and reasoning levels are quantified.

The best proprietary model, Gemini‑3.1‑Pro, scores 59.2, far below human experts at 86.6 under the streaming protocol.

Human streaming accuracy 86.6 points vs. Gemini‑3.1‑Pro 59.2 points.

Closed‑source models lead open‑source by only 5.6 points overall.

Gemini‑3.1‑Pro 59.2 vs. Qwen3‑VL‑235B‑A22B 53.6.

L4 (Allocentric Mapping) is the lowest‑scoring level for 28 of 34 systems, with an average 9.3 % drop from L1‑L3.

Average gap 9.3 % between L1‑L3 and L4; largest open‑source backbones lose 10.6 % (Qwen3‑VL‑235B‑A22B) and 13.8 % (InternVL‑3.5‑241B‑A28B).

**Table 1.** Performance comparison of various MLLM models across four categories: Instant Ego Perception, Context Tracking, Spatial Reasoning, and Allocentric Mapping.

Performance gaps between proprietary and open‑source models persist across all reasoning levels, especially in allocentric mapping.

Thinking Mode and Reasoning

We dissect how thinking mode, frame sampling, and specialized methods impact OVO‑S‑Bench performance.

The preceding benchmarks showed a large gap between current MLLMs and human spatial reasoning. This section analyzes which design choices explain that gap.

Thinking Mode augments a model with an explicit chain‑of‑thought (CoT) generation step before answering, aiming to let the model reason over the streamed visual evidence.

How does “Thinking Mode” differ from a generic chain‑of‑thought prompt?

In generic CoT prompts the model invents a reasoning trace without any guarantee it is tied to the streamed visual evidence. Thinking Mode explicitly conditions the trace on the frames seen so far, so the trace must be grounded in visible content rather than being a free‑form speculation.

**Table 3.** Thinking-mode versus non-thinking-mode variants on OVO-S-Bench. $\Delta$ is thinking minus non-thinking accuracy.

Gains do not scale with backbone size: mid‑size Qwen3.5 shows the strongest improvement, whereas larger Qwen3‑VL and InternVL‑3.5 hover near zero. Across abstraction levels, Thinking Mode consistently boosts L2 (+3.9 mean) but slightly hurts L1 (−1.0 mean).

**Table.** The table lists various policies, their prefix coverage strategies, and their types (streaming or offline). The policies include single@query, nearest-16f@4fps, uniform-128, log-decay-128, and oracle-evidence.

The one‑frame query policy (1@q) illustrates a level‑selective trade‑off: it yields a +7.8 boost on L1 for Qwen3.5‑27B but incurs a −4.1 drop on L4, echoing the perception‑memory tension.

**Table 5.** Specialized methods versus their base backbones on OVO-S-Bench. $\Delta$ is specialized minus base accuracy.

Retention analysis shows that higher Evidence Recall (ER) on L4 (0.14–0.42) is far lower than on L1 (0.60–0.76), yet the Pearson correlation between ER and correctness is essentially zero (r≈0). Thus, simply keeping more frames does not translate into better answers.

Token Compression and Sampling

Key ablation insights on sampling policies, scaling, and failure analysis.

This appendix reports the ablation studies that underpin the main results: frame‑sampling policies, backbone scaling, chain‑of‑thought failure taxonomy, and the annotation pipeline.

It measures how much temporal information survives when the model compresses a long video into a fixed‑size token stream.

How does Token‑compression Diagnostics differ from a plain token‑count budget?

Token‑compression Diagnostics looks at *where* tokens are kept (temporal distribution) and *whether* the kept tokens still cover the annotated evidence. A simple token count only tells you how many frames are used, not if they are the right ones.

**Figure 5.** Token-compression diagnostics. (a) Retained 1-s bin density vs. $t/t_q$ (dotted: uniform). (b) Evidence Recall (ER) by level. (c) ER vs. $P(\text{correct})$; legend shows per-method Pearson $r$.

Failure Mode Analysis

We catalogue the benchmark’s failure modes and how they were annotated.

Failure Mode Analysis groups the benchmark’s error cases into two tiers—process‑level and reasoning‑level—highlighting where models lose track of visual evidence or misuse their Chain‑of‑Thought reasoning.

The authors categorize every observed error into Tier 1 (the model fails to ground its answer in visual evidence) or Tier 2 (the model’s reasoning chain diverges from the visual facts).

**Figure 9.** Tier 1: Process-level failures. Two examples each for T1a (no-conclusion error, top) and T1b (non-visual error, bottom). Left and right columns show cases from different task families. Evidence frames are sampled at midpoints of the annotated evidence interval.

**Figure 10.** Tier 2: Reasoning-level failures. Two examples each for visual-content error (top), direction error (middle), and temporal-binding error (bottom). Each card shows the CoT excerpt and a bolded error analysis.

Benchmark Data Schema

Provides data schemas, benchmark comparisons, evaluation details, and illustrative examples.

The benchmark is released as a single JSONL file (`ovo_s_bench_l1_l4`.jsonl) containing 1 680 items; Table 11 enumerates each field (e.g., `video_path`, level, question, etc.) and their types.

An abbreviated example item shows the full structure, including the unique id, categorical tags, video location, taxonomy level, task type, question text, option map, timestamps, evidence intervals, and the correct answer.

Guideline updates are generated from recurring annotation disagreements (Section E.5.3) and pushed back to annotators; a versioned document records each change and the issue it resolves.

All items obey the prefix‑only protocol: evidence intervals always precede the query timestamp, and the three list fields (`query_times`, `evidence_times`, answers) are equal‑length and positionally aligned.

Section F expands Table 1’s high‑level comparison of OVO‑S‑Bench against 14 prior spatial and streaming benchmarks, detailing what each benchmark covers, where it overlaps with our claims, and the remaining gaps.

EmbSpatial‑Bench provides 3.6 K single‑image MCQs on six egocentric relations; even the strongest LVLMs (GPT‑4V 36 %, Qwen‑VL‑Max 49 %) fall far short of human performance (≈90 %).

TopViewRS supplies 11.4 K top‑view MCQs across 7 Matterport3D scenes; it confirms a >50 % human–model gap that widens from recognition to spatial reasoning, but it gives the model a bird’s‑eye map, eliminating the need for allocentric reconstruction.

MMSI‑Bench offers 1 K multi‑image questions covering ten relation types; it reveals a ≈55‑point human–model gap and four failure modes, yet it lacks a streaming protocol and any L3/L4 tasks.

VSI‑Bench (5 K items) uses offline video with full re‑attendable evidence; it shows that CoT prompting harms spatial‑video QA and that the bottleneck is spatial reasoning, but it provides no L4 allocentric queries.

DISJOINT‑3DQA (5.4 K synthetic Q) isolates cross‑frame spatial memory by never co‑visible anchors; performance degrades with spatial separation, yet the benchmark is synthetic, single‑source, and limited to L1/L2.

STI‑Bench (2 K questions) targets precise spatial‑temporal estimation; even proprietary models (Gemini‑2.5‑Pro 41.4 %) fail, but the tasks are offline, numerical, and lack streaming or L4 mapping.

MMSI‑Video‑Bench (1.1 K Q) extends MMSI‑Bench to video; it reports the largest human–AI gap on video spatial intelligence (≈58 pts) and shows that fine‑tuned models and CoT do not help, yet it remains offline and lacks L3/L4.

VSI‑SUPER presents two long‑horizon stress tests (up to 240 min); it demonstrates that brute‑force context expansion cannot close the streaming gap, but it only defines two needle‑in‑haystack tasks and omits L3/L4.

StreamingBench (4.5 K Q) introduced the prefix‑only protocol; it shows a ≈25‑point gap for the strongest model, yet spatial reasoning is only one of 18 tasks and L3/L4 are absent.

OVBench (≈7 K Q) formalizes Past/Current/Future temporal scopes; it reveals that sliding‑window offline models outperform native online ones, but its tasks focus on events rather than spatial structure.

OVO‑Bench (2.8 K Q) highlights streaming hallucination with a ≈30‑point human–model gap; it includes a single spatial‑task (STU) and lacks L3/L4 mapping.

OST‑Bench (10 K Q) targets online spatiotemporal understanding for embodied agents; it shows a sharp accuracy decline as exploration grows, yet it is limited to indoor agent‑state relations (L1/L2) and no allocentric mapping.

ODV‑Bench (12 K Q) evaluates autonomous‑driving streams; it confirms streaming spatial reasoning is critical for safety, but tasks stop at short‑horizon prediction and lack L3/L4.

Our synthesis (Section F.4) identifies three recurring gaps: (i) spatial benchmarks lack streaming, (ii) streaming benchmarks rarely include spatial structure, and (iii) no prior benchmark covers all four spatial abstraction levels, especially L4 allocentric mapping.

OVO‑S‑Bench addresses these gaps by enforcing per‑item prefix‑only queries, spanning indoor, outdoor, and 3‑D‑rendered domains, and stratifying tasks into L1–L4 levels with explicit allocentric supervision.

Text‑only shortcut analysis (Section F.5) shows a +5.8 pp advantage for OVO‑S‑Bench, comparable to the median across 12 peer spatial benchmarks, confirming that our blind‑review pipeline successfully suppresses language shortcuts.

Source‑robustness analysis (Section F.6) removes the dominant RoomTour3D source (48 % of items); rankings remain highly correlated (Spearman $\rho$ = 0.92, p < 10⁻⁴), indicating genuine cross‑domain difficulty rather than a single‑source artifact.

Section G details the evaluation pipeline: model inventory (G.1), uniform decoding (G.2), default 128‑frame sampling (G.3), streaming‑model ingestion rates (G.4), unified prompt template (G.5), policy‑specific input variants (G.6), deterministic answer extraction (G.7), and hardware/software stack (G.8).

Models are run with their published configurations; closed‑source APIs were accessed between 2026‑02 and 2026‑04, and all visual inputs are cached using a SHA‑1 content address to guarantee reproducibility.

Decoding uses greedy generation (T = 0) with a 1 024‑token cap for non‑thinking rows; thinking‑mode rows follow the model‑specific recipes listed in Table 14.

For each query at time $t_q$, the default visual input consists of $N=128$ uniformly sampled frames from the prefix $[0, t_q]$, resized to each model’s prescribed resolution (336–512 px) and respecting per‑frame token caps.

Streaming models ingest frames sequentially at their published rates (see Table 15); they produce a compressed state used for answer generation without imposing the uniform $N=128$ budget.

All models receive the same multiple‑choice prompt, except InternVL‑3.5‑thinking which adds a step‑by‑step reasoning instruction; the prompt explicitly requests only the answer letter.

Policy‑specific inputs vary only in the visual sampling policy (e.g., single@query, uniform‑32, log‑decay‑128); the textual prompt remains unchanged across policies.

Answer extraction scans the model output in a tail‑first order, matching patterns such as “Answer: X” or a bare final letter; unmatched outputs are counted as incorrect.

Open‑source inference runs on NVIDIA H800/H200 GPUs via vLLM ≥ 0.11 when supported; closed‑source models use provider‑hosted OpenAI‑compatible endpoints, and all video decoding uses decord with an OpenCV fallback.

Future work will automate data construction, explore 3‑D‑aware memory and world‑model imagination, and broaden evaluation with open‑ended answers, richer human baselines, and interactive embodied tasks.

Section I presents representative examples for every level (L1–L4) and task family, pairing query timestamps, question text, options, and ground‑truth answers to illustrate the benchmark’s diversity.

**Figure.** Level 1.2 Local Spatial Relationships

**Figure 12:** Per-source accuracy (%) for the top-8 reported models. Sources ordered left-to-right by question count (descending).

Read the original paper

Open the simplified reader on Paperglide