Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
Astra enables VLMs to perform agentic spatial reasoning by actively querying an action-conditioned world simulator.
How can we improve the spatial reasoning of Vision-Language Models by allowing them to interact with a world simulator to "imagine" and test spatial hypotheses?
Vision-Language Models (VLMs) struggle with spatial reasoning when required to infer unobserved layouts or maintain consistency across viewpoints from limited egocentric images. Astra addresses this by treating spatial reasoning as an interactive evidence-acquisition process: the model learns to issue camera-motion queries to a world simulator, which generates spatially consistent novel-view observations for the model to reason over. This agentic approach improves the Qwen3-VL-8B backbone from 29.8 to 38.8 on MMSI-Bench, demonstrating that effective imagination requires both reliable simulation and a learned policy for selective tool use.
Paper Primer
The core mechanism, Astra-WM, is a Bagel-based world simulator fine-tuned with view consistency tuning to ensure generated images preserve scene identity and follow requested camera motions. Astra-VL, the reasoning policy, is trained via a two-phase reinforcement learning curriculum: the first phase teaches valid simulator interaction, while the second phase optimizes for selective imagination by rewarding tool use only when it improves reasoning accuracy over direct answering.
Astra significantly outperforms direct-answer baselines on multi-view spatial reasoning benchmarks.
Astra-VL improves the Qwen3-VL-8B backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube.
Spatially consistent world simulation is a prerequisite for effective agentic reasoning.
Forced tool-use with Astra-WM improves Gemini-3-Flash accuracy from 45.1 to 49.5 on MMSI-Bench, whereas off-the-shelf generation models provide negligible or negative gains. A 4.4-point improvement over generic generation models.
Why is "forced" tool use insufficient for spatial reasoning?
Forced tool use requires the model to interact with the simulator mechanically but fails to teach it when additional evidence is actually needed, which viewpoint is informative, or how to ground the returned observation in the original context.
What is the primary bottleneck in current VLM spatial reasoning?
Current models are tied to fixed visual inputs and lack the ability to actively resolve spatial ambiguity by seeking additional evidence from alternative perspectives.
The Spatial Reasoning Gap
We expose VLMs' static spatial limit and introduce Astra, letting them imagine new views via a world simulator.
Current Vision‑Language Models (VLMs) excel at visual reasoning but remain confined to the static images they receive. When only a few egocentric views are available, they cannot infer unobserved layout, maintain cross‑view consistency, or resolve spatial ambiguities that would become clear from a different viewpoint.
We frame this shortcoming as a need for “imagination”: a VLM should be able to request imagined visual evidence by interacting with a world simulator. Astra addresses this by coupling an RL‑trained VLM policy (Astra‑VL) with a Bagel‑based world simulator (Astra‑WM) that generates novel‑view observations conditioned on natural‑language camera‑motion commands.
Preliminary experiments show that both components are essential: Astra‑WM improves simulator‑augmented Gemini‑3‑Flash on MMSI‑Bench from 45.1 to 49.5, while Astra‑VL lifts the Qwen3‑VL backbone from 29.8 to 38.8 on MMSI‑Bench and from 36.8 to 42.7 on MindCube. These gains demonstrate that reliable imagined observations and a learned policy for when to invoke them are both required for effective spatial reasoning.
**Figure 1.** Reasoning trajectory of Astra. Astra tackles the challenging visual spatial reasoning task by agentic leveraging the world simulator within the reasoning process.
The shift from static observation to interactive simulation enables VLMs to acquire missing visual evidence and resolve spatial uncertainty.
The Astra Architecture
We detail how Astra links a world simulator with an agentic VLM policy to enable interactive spatial reasoning.
Static VLMs cannot query unseen viewpoints, so they often fail on spatial questions that require looking around. Astra solves this by giving the VLM a plug‑in world simulator it can call on demand. The simulator supplies imagined observations that are spatially consistent with the requested camera motion, letting the policy reason beyond the original images.
Astra‑WM takes the current set of context images, a reference image, and a natural‑language camera‑motion command, and produces a novel view that matches the requested motion while preserving the scene’s layout.
The generator $W$ encodes $I_1$ and $I_2$, attends to the reference $I_1$, and conditions on $u_t$.
It predicts a depth map for the forward motion and synthesizes a new image $\hat{I}_{t+1}$ that shows the scene from a point 1 m ahead of the original front view.
Object positions shift consistently: a chair that was 2 m away in $I_1$ now appears 1 m away in $\hat{I}_{t+1}$.
The generated view preserves the wall texture and lighting, confirming spatial consistency.
The example shows how view‑consistency tuning forces the simulator to respect both the commanded motion and the underlying scene geometry, rather than producing a generic plausible image.
How does Astra‑WM differ from a generic image‑to‑image generator?
Generic generators ignore the explicit camera command and may change object identities; Astra‑WM is conditioned on a concrete motion instruction and is trained to keep object identities and relative layout stable across the imagined viewpoint.
The policy maintains the full interaction history as a trajectory $T_t$ and at each step decides either to answer directly or to invoke the simulator for a new view that could clarify the spatial question.
Why not simply let the VLM generate an answer without ever calling the simulator?
Because many spatial questions require a viewpoint that is not present in the original images; without the simulator the policy would have to guess, leading to systematic errors on ambiguous queries.
The curriculum first teaches the policy to issue valid simulator queries, then encourages it to invoke the simulator only when the imagined view yields a measurable gain over answering directly.
What would happen if the policy kept invoking the simulator even when $\Delta_i$ is negative?
The negative‑gain penalty $-\beta\max(0,-\Delta_i)$ would reduce the total reward, causing the policy to learn to stop calling the simulator in situations where imagined views degrade answer quality.
**Figure 4.** Pipeline for Constructing Training Data for the World Simulator
**Figure 2.** Astra consists of two components: Astra-VL and Astra-WM. The overview illustrates the input-output details of both models during training and inference, as well as the two-phase reinforcement learning training pipeline.
Evaluation and Performance
Astra achieves large accuracy gains on multi‑view spatial reasoning benchmarks.
Astra raises MMSI‑Bench exact‑match accuracy from 29.8 % to 38.8 % (+9.0 %) and Mind‑Cube from 36.8 % to 42.7 % (+5.9 %) under Agentic Tool‑Use.
Table 1 reports the Agentic Tool‑Use results for Astra Qwen3‑VL‑8B compared to Direct Answer baselines.
Implementation follows the RL stage only: Astra‑VL is fine‑tuned from a Qwen3‑VL‑8B checkpoint, the vision tower stays frozen, and the policy is trained with bfloat16 FSDP, gradient checkpointing, and the Clip‑Higher strategy ($\epsilon_{l}$ow=0.2, $\epsilon_{h}$igh=0.28). Each rollout permits up to three tool rounds and ten assistant turns, with $\lambda_{f}$mt=0.5, $\alpha$=0.1, $\beta$=0.03, c=1, $\lambda_{u}$se=0.02.
MMSI‑Bench is a collection of 1,000 multi‑view spatial reasoning questions covering diverse spatial‑relation categories.
Mind‑Cube evaluates multi‑view spatial reasoning inside a structured 3D environment, requiring the model to reason across synthesized camera poses.
**Figure.** A visual question answering task showing two images of a living room and a multiple-choice question regarding the relative camera position.
**Figure.** Two egocentric images of a laundry room showing different viewpoints, followed by a multiple-choice question regarding the spatial relationship between the two camera positions.
Astra consistently outperforms static baselines on multi‑view spatial reasoning.
Ablation Study
Astra lets a VLM query a world simulator to imagine new viewpoints, enabling spatial reasoning beyond static images.
We isolate three factors that drive Astra’s gains: simulator consistency, the two‑phase RL curriculum, and the inference‑time workflow mode.
Simulator Consistency guarantees that imagined views preserve both camera motion and scene layout, so the VLM can rely on the generated observations for downstream reasoning.
How does Simulator Consistency differ from a generic image generator like Bagel?
Bagel aims only for photorealism; it can change object positions or camera pose arbitrarily. Simulator Consistency explicitly aligns the generated view with the original camera pose and keeps the scene layout fixed, providing spatially trustworthy evidence.
The full two‑phase RL curriculum yields the highest overall MMSI‑Bench score.
Table 3 shows an overall score of 38.8, surpassing single‑stage and phase‑only variants.
**Table 3.** Effectiveness of the Two-Phase RL Curriculum on MMSI-Bench.
**Figure 3.** Inference-time workflow mode ablation of our Astra on MMSI-Bench.
Implementation and Reproducibility
Appendix provides data details, evaluation metrics, error analysis, and licensing.
We collect 11,292 scenes from ScanNet, ScanNet++, Matterport3D, ARKitScenes, and DL3DV, each represented as an RGB‑D video with per‑camera image, depth, and pose tuples. From these scenes we build a large‑scale dataset of (context images Ictx, motion prompt p, target image Itgt) triples for novel‑view synthesis.
Two constraints ensure useful training pairs: the coverage ratio of the target view must be at least 0.85, and the context cameras must differ by ≥ 1 m horizontally or vertically, and by ≥ 30° in yaw or pitch.
For each valid pair we pick one context camera as the source, compute the relative transform $\Delta$T, decompose it into (dx, dy, dz, d$\theta$, dϕ), and translate the dominant component into a natural‑language prompt. This pipeline yields 544,197 training samples.
Evaluation uses 1,000 samples drawn uniformly (200 per dataset) from the test splits of DL3DV, ScanNet, ScanNet++, Matterport3D, and ARKitScenes, assessing both pose and content consistency of the generated views.
Pose consistency checks whether the generated RGB image matches the commanded camera motion by estimating depth, aligning to the source pose, and comparing the recovered relative transform to the ground‑truth.
Content consistency measures object‑level recall/precision and spatial topology agreement between generated and target images, using a VLM to list key categories and GroundingDINO to match detections.
We analyze four representative failure modes to understand how the agentic imagination interacts with the world simulator.
Case 1 shows the model requesting a new viewpoint that directly resolves an ambiguous spatial relation, allowing it to revise its hypothesis before answering.
Case 2 illustrates correct tool invocation but an uninformative camera motion that leaves the target object out of view, providing no useful evidence.
Case 3 highlights simulator errors: the generated observation deviates from the commanded pose or drops/ hallucinates objects, breaking spatial consistency.
Case 4 reveals that even when the simulator returns a useful view, the model may ignore it, confuse image indices, or over‑trust the generated observation, leading to incorrect answers.
During training and inference we employ a unified prompt template that encodes the tool schema, action format, and observation handling for the agentic workflow.
Limitations include over‑use or under‑use of the simulator, generation of visually plausible but unhelpful views, and occasional confusion of original versus generated image indices; future work should improve selective imagination, information‑gain policies, and verification steps.
All raw scene data come from publicly released datasets: ScanNet, ScanNet++, Matterport3D, ARKitScenes, and DL3DV, each governed by its respective license (see references [51]–[52] for details).
Questions & answers
What is the main contribution of the Astra paper?
Astra introduces an agentic spatial reasoning framework in which a VLM policy (Astra-VL) learns to issue camera-motion queries to a world simulator (Astra-WM) that generates spatially consistent novel-view observations, allowing the model to actively resolve spatial ambiguity rather than relying solely on fixed input images.
What problem does Astra address?
Astra addresses the inability of current Vision-Language Models to infer unobserved scene layouts, maintain cross-view consistency, or resolve spatial ambiguities when only a few egocentric images are available, because these models are confined to the static images they receive.
Why is forced tool use insufficient for spatial reasoning?
Forced tool use causes the model to interact with the simulator mechanically without learning when additional evidence is actually needed, which viewpoint is informative, or how to ground the returned observation in the original context.
How does Astra-WM work and how does it differ from a generic image generator?
Astra-WM is a Bagel-based world simulator fine-tuned with view consistency tuning so that generated images preserve scene identity and follow requested camera motions; unlike generic generators such as Bagel, it is conditioned on a concrete natural-language motion instruction and keeps object identities and relative layout stable across the imagined viewpoint.
How is Astra-VL trained?
Astra-VL is trained via a two-phase reinforcement learning curriculum: the first phase teaches valid simulator interaction, and the second phase optimizes for selective imagination, penalizing unnecessary or harmful tool invocations through a negative-gain penalty term.
What is the negative-gain penalty and what does it accomplish?
The negative-gain penalty is formulated as -β·max(0, -Δi) and reduces total reward whenever a simulator call degrades answer quality, causing the policy to learn to stop invoking the simulator in situations where imagined views are unhelpful.
What benchmarks are used to evaluate Astra?
Astra is evaluated on MMSI-Bench and MindCube; the paper also evaluates the world simulator component on 1,000 held-out samples drawn uniformly (200 per dataset) from DL3DV, ScanNet, ScanNet++, Matterport3D, and ARKitScenes.
What are the key quantitative results reported for Astra?
Astra-VL lifts the Qwen3-VL-8B backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube; separately, replacing a generic simulator with Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5.
What dataset is used to train Astra-WM, and how large is it?
Astra-WM is trained on 544,197 (context image, motion prompt, target image) triples constructed from 11,292 scenes collected across ScanNet, ScanNet++, Matterport3D, ARKitScenes, and DL3DV, each represented as an RGB-D video with per-camera image, depth, and pose tuples.
What constraints are applied when building training pairs for Astra-WM?
Valid training pairs require the target view's coverage ratio to be at least 0.85, and the context cameras must differ by at least 1 m horizontally or vertically and by at least 30° in yaw or pitch.
How are pose consistency and content consistency measured for the world simulator?
Pose consistency checks whether the generated image matches the commanded camera motion by estimating depth, aligning to the source pose, and comparing the recovered relative transform to ground truth; content consistency measures object-level recall/precision and spatial topology agreement using a VLM to list key categories and GroundingDINO to match detections.
What are the key implementation details for training Astra-VL?
Astra-VL is fine-tuned from a Qwen3-VL-8B checkpoint with the vision tower frozen, using bfloat16 FSDP, gradient checkpointing, and the Clip-Higher strategy (ε_low=0.2, ε_high=0.28); each rollout permits up to three tool rounds and ten assistant turns, with hyperparameters λ_fmt=0.5, α=0.1, β=0.03.
What failure modes does the paper identify for Astra?
The paper identifies four failure modes: (1) successful viewpoint selection that resolves ambiguity, (2) correct tool invocation but an uninformative camera motion leaving the target out of view, (3) simulator errors where the generated view deviates from the commanded pose or hallucinates objects, and (4) the model ignoring a useful returned view, confusing image indices, or over-trusting the generated observation.
What are the stated limitations of Astra?
The paper acknowledges over-use or under-use of the simulator, generation of visually plausible but unhelpful views, and occasional confusion of original versus generated image indices; it identifies improving selective imagination, information-gain policies, and verification steps as future work.
How does Astra compare to simply letting the VLM answer without calling the simulator?
Without the simulator, the policy must guess about viewpoints not present in the original images, leading to systematic errors on ambiguous spatial queries; the paper shows that Astra-VL's agentic approach raises Qwen3-VL-8B from 29.8 to 38.8 on MMSI-Bench over the static baseline.
What ablation factors does the paper isolate?
The ablation study isolates three factors: simulator consistency (spatially trustworthy view generation), the two-phase RL curriculum, and the inference-time workflow mode.
What venue, authors, and date are associated with this paper?
The paper does not specify author names or a publication venue in the provided text; it is available at arxiv.org/abs/2606.06476, but the paper does not state a submission or publication date.
Key terms
- VLM (Vision-Language Model)
- A neural model trained to process and reason over both images and text jointly.
- Astra
- The full agentic spatial reasoning framework introduced in the paper, combining the Astra-VL policy and the Astra-WM world simulator.
- Astra-VL
- The reinforcement-learning-trained VLM policy component of Astra that decides when and how to query the world simulator for novel-view observations.
- Astra-WM
- The world simulator component of Astra, built on Bagel and fine-tuned with view consistency tuning to generate spatially consistent novel-view images from natural-language camera-motion commands.
- Bagel
- A pre-existing image generation model used as the base architecture for Astra-WM, which the paper notes aims for photorealism but does not inherently preserve camera pose or scene layout.
- MMSI-Bench
- A benchmark used in the paper to evaluate multi-view spatial reasoning performance of VLMs.
- MindCube
- A second spatial reasoning benchmark used in the paper to evaluate Astra-VL's performance.
- Novel-view synthesis
- The task of generating a realistic image of a scene from a camera position or angle not present in the original input images.
- View consistency tuning
- A fine-tuning procedure applied to Astra-WM to ensure that generated images preserve scene identity and accurately reflect the requested camera motion.
- Two-phase RL curriculum
- A reinforcement learning training schedule for Astra-VL in which the first phase teaches valid simulator interaction and the second phase optimizes for selective, beneficial use of the simulator.
- Negative-gain penalty
- A reward term (-β·max(0, -Δi)) that penalizes the policy whenever invoking the simulator reduces answer quality, discouraging unnecessary tool calls.
- Clip-Higher strategy
- A reinforcement learning clipping approach used during Astra-VL training, parameterized by a lower clipping bound ε_low=0.2 and a higher clipping bound ε_high=0.28.
- Pose consistency
- A metric that checks whether a generated image matches the commanded camera motion by estimating depth, aligning to the source pose, and comparing the recovered relative transform to ground truth.
- Content consistency
- A metric that measures object-level recall/precision and spatial topology agreement between a generated image and the target image using a VLM and GroundingDINO.
- GroundingDINO
- An object detection model used in the paper to match detected objects between generated and target images when computing content consistency.
- FSDP (Fully Sharded Data Parallel)
- A distributed training strategy that shards model parameters, gradients, and optimizer states across devices to enable training of large models.
- Egocentric images
- Images captured from a first-person or agent-centered viewpoint, typically providing a limited field of view of the surrounding scene.
- Coverage ratio
- A threshold metric used in dataset construction requiring that at least 85% of the target view is visible from the source context, ensuring useful training pairs.
- ScanNet / ScanNet++ / Matterport3D / ARKitScenes / DL3DV
- Five publicly released RGB-D scene datasets used to build the training and evaluation data for Astra-WM.