Reason, Then Re-Reason: Cross-View Revisiting Improves Spatial Reasoning

Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao

ReRe improves spatial reasoning by forcing MLLMs to verify initial hypotheses against synthesized novel-view videos.

How can we improve spatial reasoning in egocentric videos by forcing models to verify their initial hypotheses against synthesized novel viewpoints?

Egocentric videos provide limited, trajectory-constrained views, forcing Multimodal Large Language Models (MLLMs) to resolve 3D spatial ambiguities using unreliable semantic priors. The authors propose Reason, then Re-reason (ReRe), a training-free framework that prompts the model to form an initial spatial hypothesis, then verify it against a synthesized, allocentric novel-view video. This two-phase protocol allows models to detect and correct spatial hallucinations, significantly boosting performance on benchmarks like VSI-Bench and STI-Bench without requiring architectural changes or fine-tuning.

Paper Primer

ReRe treats spatial reasoning as a revisitable process rather than a single-turn inference task. The core move is a Geometry-to-Video pipeline: it uses monocular 3D reconstruction to synthesize an "oblique sweep" perspective, which provides the complementary visual evidence needed to disambiguate the scene.

ReRe enables open-source MLLMs to rival or surpass proprietary state-of-the-art models on spatial reasoning tasks.

On VSI-Bench, Qwen3-VL-4B performance increased by 5.8%, while on STI-Bench, InternVL2.5-8B achieved an average score of 34.8, outperforming GPT-4o (31.0). Significant performance gains across diverse architectures, with sample-level analysis showing a 2.52:1 ratio of positive-to-negative answer flips.

The revisiting protocol, rather than increased compute, is the primary driver of performance.

Ablation studies show that simply "thinking twice" on the original video degrades performance, while naive concatenation of views fails to resolve occlusions. The synergy between egocentric semantics and allocentric structural evidence is essential for verification.

Why is a single-turn inference paradigm structurally fragile for egocentric spatial reasoning?

Because the visual evidence is strictly conditioned on the camera trajectory, the model often lacks the necessary viewpoints to resolve 3D layouts, forcing it to rely on potentially incorrect semantic priors to fill in the gaps.

Why does the framework use an "oblique sweep" trajectory for the synthesized video?

The authors found that horizontal traverses fail to resolve occlusions, while top-down orbits deviate too far from the MLLM's pre-training distribution; the oblique angle balances structural coverage with recognizable visual features.

Spatial Reasoning via Re-reasoning

We expose why single‑turn egocentric reasoning fails and propose a revisiting framework.

Egocentric video streams give only the camera‑bound view of a scene, so the observable evidence is limited and often occluded. This makes spatial reasoning brittle: models must infer 3‑D layouts from sparse, viewpoint‑conditioned frames, which frequently leaves geometric ambiguities unresolved.

Existing approaches assume a single‑turn inference, forcing a Multimodal Large Language Model (MLLM) to produce a final answer from the limited video. When the evidence is insufficient, the model falls back on semantic priors rather than verifiable visual cues, leading to provisional and sometimes incorrect conclusions.

We propose Reason, then Re‑reason (ReRe), a training‑free inference framework that splits spatial reasoning into two phases. In the Reason Phase the MLLM forms an initial hypothesis from the original egocentric video; in the Re‑reason Phase it revisits that hypothesis after observing a synthesized novel‑view video, allowing verification or correction.

To supply the complementary evidence, we build a Geometry‑to‑Video pipeline that first predicts 3‑D structure from the egocentric video and then renders strategically chosen novel viewpoints. The selected views adopt an elevated, oblique perspective that spans the scene and reveals occluded regions, while remaining in a standard video format that the frozen MLLM can consume without architectural changes.

Our contributions are: (1) identifying the fragility of single‑turn egocentric reasoning and advocating a revisitable paradigm; (2) introducing the ReRe framework with a two‑phase protocol and a geometry‑driven view synthesis pipeline; (3) demonstrating substantial gains on VSI‑Bench and STI‑Bench across open‑source MLLMs; and (4) providing ablations that isolate the impact of cross‑view verification.

**Figure 1.** ReRe enables the model to revisit its initial hypothesis under a synthesized novel view, correcting spatial reasoning errors that single-turn inference misses. Each case shows the original egocentric video (top frames) and the synthesized novel view (bottom frames), along with the model's reasoning before (Reason Phase, blue) and after (Re-Reason Phase, red) revisiting. (a) Object Counting. The synthesized elevated view reveals a chair occluded by the desk, prompting the model to correct its initial count. (b) Route Planning. The expanded perspective exposes a previously unobserved target, enabling the model to revise its hallucinated command.

The key insight is that moving from single‑turn perception to multi‑turn verification enables models to correct spatial errors using synthesized allocentric views.

Context in Spatial Understanding

Key trends in visual spatial understanding and the gap our revisitable reasoning fills.

Early work on visual spatial understanding treated each image as an isolated snapshot, grounding objects and inferring layout from a fixed, often occluded perspective.

Multimodal Large Language Models (MLLMs) have shown strong performance on such static tasks, but their reliance on a single frame inherently caps spatial reasoning.

More recent research has moved toward video‑based spatial understanding, treating a video as a continuous trajectory of viewpoints whose frames must be aggregated into a coherent 3D representation.

This shift has spawned two streams: training‑based methods that fine‑tune on 3D‑grounded data, and training‑free approaches that reformat sequential spatial cues into formats directly consumable by MLLMs.

Both streams remain bound to a single‑turn inference paradigm, forcing the model to resolve geometric ambiguities from a pre‑recorded trajectory without any mechanism to verify its hypothesis against additional visual evidence.

Our work introduces a revisitable reasoning paradigm: after an initial hypothesis, the system synthesizes complementary allocentric views and forces the model to re‑evaluate the hypothesis against this new evidence.

Recent breakthroughs in monocular geometry prediction, exemplified by VGGT, demonstrate that dense 3D structure can be recovered from a single image at scale.

Prior approaches typically embed VGGT’s latent geometry into the MLLM’s feature space or use it as auxiliary supervision, which requires architectural changes and still operates in a single‑turn fashion.

In contrast, we treat the generative geometry output as explicit visual evidence, allowing a training‑free MLLM to directly verify its spatial reasoning by observing synthesized views.

The ReRe Framework

Methodology details the ReRe two‑phase inference trick that adds an allocentric view for verification.

Egocentric videos capture only a narrow, trajectory‑conditioned slice of a scene, leaving many spatial relations ambiguous.

The framework first asks the model to articulate a provisional hypothesis from the egocentric view, then asks it to re‑evaluate that hypothesis against a synthesized allocentric video.

How does ReRe differ from the standard single‑turn inference formulation?

Standard inference predicts $A$ directly from $V_{ego}$ and $Q$, assuming the view is sufficient. ReRe inserts an intermediate hypothesis $H$ and a second pass over a synthesized allocentric video $V_{exo}$, allowing the model to check its own reasoning against new geometric evidence before committing to $A^*$.

From $V_{ego}$ the geometry module predicts a 3D point cloud with the cup at $(x=0.2, y=0.0, z=0.5)$ meters.

Oblique Sweep trajectory plans a diagonal camera path that will view the scene from the opposite corner.

View rendering produces $V_{exo}$ showing the cup from the far‑right side, where the cup appears on the right edge.

The Re‑reason Phase feeds $V_{exo}$ and $H$ to the MLLM, which detects the inconsistency and flips the answer to “no”.

The final answer $A^*$ is thus corrected to “no”.

The example shows how the allocentric view can expose a mistaken spatial assumption that the egocentric view alone could not reveal.

Instead of circling the scene at a fixed elevation, the trajectory sweeps diagonally through the scene center, exposing both horizontal and vertical variations in a single pass.

Why not simply rotate the camera around a horizontal circle (a “bird’s‑eye orbit”)?

A horizontal orbit keeps the camera at a constant elevation, so many vertical surfaces remain unseen and occlusions persist. The diagonal Oblique Sweep simultaneously varies elevation and azimuth, exposing surfaces that a flat orbit would miss and providing richer geometric evidence for verification.

At $\theta=0$, the camera is at $p_{start}=(-1,0,-1)$ looking toward $c$.

At $\theta=\pi/2$, the camera moves to $(0,0,0)$, directly above the scene center.

At $\theta=\pi$, the camera reaches $p_{end}=(1,0,1)$, opposite corner.

Each step renders a frame; the three frames together reveal all four walls of the room.

The resulting video $V_{exo}$ contains viewpoints that together cover $360^\circ$ of the scene with only three frames.

The diagonal sweep achieves full coverage with fewer frames than a full circular orbit, illustrating its efficiency.

From a single egocentric video, a geometry predictor infers a sparse 3D point cloud representing the scene layout.

**Figure 2.** Overview of the ReRe Framework. Given an egocentric video, our method operates in two phases: (1) Reason Phase, where the MLLM forms an initial hypothesis from the original view; and (2) Re-reason Phase, where the model verifies its hypothesis against a synthesized allocentric view ($V_{exo}$). The Geometry-to-Video pipeline generates $V_{exo}$ via trajectory planning and view rendering to provide complementary geometric evidence.

**Figure 3. Overview of the Geometry-to-Video Pipeline.** It consists of two stages: (1) Trajectory Planning, where we predict a 3D point cloud via VGGT and design a scene-spanning Oblique Sweep path; and (2) View Rendering, where we synthesize temporally coherent video frames $V_{exo}$ via point-based rasterization.

**Figure 4.** Visual Comparison of Allocentric Trajectory Designs. (a) Oblique Sweep (Ours) follows a diagonal path through the scene center with an elevated tilt. (b) Mid-level Traverse moves horizontally along the diameter at a fixed elevation. (c) Bird's-eye Orbit circles the scene center from a top-down perspective.

Evaluation and Benchmarking

ReRe consistently lifts spatial‑reasoning scores across diverse MLLM backbones.

ReRe improves average performance on VSI‑Bench by up to +8.5% across open‑source MLLMs.

Table 1 shows gains such as +5.8% for Qwen3‑VL‑4B and +8.0% for Qwen3‑VL‑2B, with the largest improvement of +8.5% for Qwen3‑VL‑2B.

VSI‑Bench evaluates fine‑grained spatial understanding in egocentric video by asking both multiple‑choice and numerical questions across eight capabilities.

How does VSI‑Bench differ from typical video‑QA benchmarks?

Typical benchmarks use coarse multiple‑choice questions and ignore fine‑grained spatial relations. VSI‑Bench adds numerical answer formats, requires multi‑object correspondence, and evaluates eight distinct spatial capabilities, making it a stricter test of spatial reasoning.

**Table 1.** Performance on VSI-BENCH. ReRe boosts spatial reasoning across diverse MLLMs. † denotes VSI-BENCH (tiny) subset.

**Table.** Performance comparison of various models across different tasks.

ReRe provides consistent gains across multiple MLLM architectures.

Limitations and Ablations

We examine how each ReRe component affects accuracy and latency.

The core premise—that a revisitable reasoning loop with an allocentric view improves spatial reasoning—remains the same, but we now ask what each piece contributes.

Qualitative Analysis and Efficiency

Appendix showcases qualitative corrections, prompt format, and efficiency analysis of ReRe.

**Figure 5.** Qualitative Results on VSI-BENCH. We visualize how ReRe resolves spatial ambiguities in (a)-(b) Object Counting, (c) Absolute Distance, and (d) Relative Direction.

The prompt template follows a two‑phase “Reason, Then Re‑reason” workflow. Phase 1 asks the model to form an initial hypothesis from the egocentric video, while Phase 2 supplies the newly synthesized view and asks the model to revise its answer. This structure forces the model to explicitly compare its original reasoning against fresh geometric evidence.

Table 6 reports a 2.52 : 1 ratio of positive to negative flips among samples altered by ReRe, indicating that corrections substantially outweigh new errors. Positive flips (baseline wrong → ReRe correct) constitute 71.6 % of changed samples and 11.21 % of the entire dataset, whereas negative flips (baseline correct → ReRe wrong) account for only 28.4 % and 4.44 % respectively.

Table 7 compares three settings on VSI‑Bench using the Qwen3‑VL‑8B MLLM. The “ReRe (100 frames)” variant achieves the highest VSI Avg score (35.8) while incurring about 11 s total runtime, whereas the “ReRe (20 frames)” variant reduces runtime to roughly 4 s with a modest drop to 34.2. The single‑turn baseline runs in ~1 s but lags behind both ReRe variants in spatial reasoning performance.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers