SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong

SPATIALWORLD is a unified, simulator-agnostic benchmark for evaluating the interactive spatial reasoning of multimodal agents in 3D environments.

How can we move beyond static image-based benchmarks to rigorously evaluate the active, multi-step spatial reasoning capabilities of multimodal agents in 3D environments?

Current spatial reasoning benchmarks rely on static image analysis or simulator-specific pipelines, failing to capture how agents must actively explore and update their beliefs in dynamic, partially observable 3D spaces. The authors introduce SPATIALWORLD, a unified evaluation framework that abstracts eight heterogeneous simulators into a shared, text-based interface where agents solve tasks using only egocentric RGB observations. Across 15 state-of-the-art models, performance remains low: the strongest model, GPT-5, achieves an average task success rate of only 17.4%, revealing a persistent gap in long-horizon planning and interactive spatial intelligence.

Paper Primer

SPATIALWORLD forces agents to solve 760 human-annotated tasks by treating the environment as a partially observable Markov decision process. The core move is the creation of a standardized, simulator-agnostic I/O bottleneck: agents receive only raw egocentric screenshots and must output high-level text commands, effectively decoupling the agent's reasoning engine from the underlying physics engine.

Current multimodal agents lack robust interactive spatial reasoning.

Evaluation of 15 advanced models across 760 tasks in eight diverse 3D environments. The top-performing model, GPT-5, achieved a task success rate of only 17.4%, with the leading open-source model, Qwen-3.5, reaching 14.1%.

Task success does not correlate with execution efficiency.

Comparison of Task Success Rate (TSR) and Step Efficiency (SE) metrics. Models with similar success rates show divergent efficiency, indicating that some agents rely on redundant trial-and-error rather than efficient planning.

Why is a unified, simulator-agnostic protocol necessary for this evaluation?

Existing benchmarks are often coupled to specific simulators or sensor assumptions, making it impossible to determine if an agent's success reflects general spatial reasoning or merely adaptation to a particular environment's quirks.

How does this benchmark differ from standard Visual Question Answering (VQA) datasets?

Unlike static VQA, which tests scene recognition, SPATIALWORLD requires closed-loop interaction where agents must actively navigate and manipulate the environment to gather visual evidence over time.

The benchmark's reliance on terminal-state verification—rather than trajectory matching—allows for diverse, open-ended reasoning paths, making it a rigorous testbed for autonomous agents.

Researchers should shift focus from static perception to active exploration; SPATIALWORLD provides the standardized testbed needed to diagnose specific bottlenecks in long-horizon planning and spatial grounding.

Introduction: The Need for Active Spatial Benchmarking

SpatialWorld redefines evaluation by demanding active, closed‑loop reasoning in diverse 3D settings.

Current spatial benchmarks treat agents like static observers, asking a single question about a frozen scene. This passive setup cannot reveal whether a model can explore, update its belief, and execute a multi‑step plan in a partially observable world.

To measure true embodied spatial reasoning we need three ingredients: (1) agents see only egocentric RGB frames, (2) they issue high‑level commands through a plain‑text action space, and (3) the same interface works across any 3D simulator.

A unified, simulator‑agnostic loop that feeds raw visual observations to a multimodal model, receives text‑based actions, and verifies the resulting state.

SPATIALWORLD aggregates 760 tasks covering daily routines, work, entertainment, travel, social collaboration, and 3D games, providing a broad testbed for active spatial understanding.

**Figure 1.** SpatialWorld is a scalable, general-purpose evaluation framework for multimodal agents, supporting end-to-end task solving and structured plan generation. It unifies diverse 3D backends under a standardized observation-action interface, enabling rigorous assessment of interactive spatial reasoning via reproducible benchmarks and automated efficiency metrics.

**Figure 10.** Why GPT-5 currently outperforms GPT-5.4. GPT-5 achieves higher shared-task TSR in most physical environments, while GPT-5.4 exhibits a stronger tendency toward premature termination. The step-count plots further show that GPT-5 typically spends more actions both when it succeeds and when it fails, consistent with a slower but more persistent search strategy.

The field must shift from static VQA toward active, embodied evaluation to truly gauge spatial reasoning.

The SpatialWorld Benchmark Protocol

Defines the experimental setup, variables, and metrics for evaluating active spatial reasoning.

The benchmark protocol operationalizes the four design principles by specifying what varies, what stays fixed, and which outcomes are recorded.

$TSR$ measures the proportion of evaluation episodes in which the agent achieves the prescribed goal.

How does $TSR$ differ from a simple accuracy score on a static image QA benchmark?

$TSR$ evaluates an agent’s ability to act in a dynamic environment until a goal state is reached, whereas static accuracy only checks a single prediction against a fixed label.

$SE$ quantifies how economically an agent traverses the environment to accomplish a task.

Why isn’t $SE$ simply the inverse of trajectory length?

Because $SE$ normalizes by the optimal (shortest‑possible) path length for the same start‑goal pair, making it comparable across tasks of varying difficulty.

Select a set of environments spanning the unified cross‑platform interface.

For each environment, sample a collection of goal‑directed tasks.

Initialize the agent with no privileged state (pure egocentric vision).

Run the agent in closed‑loop mode, allowing it to issue actions until termination.

Record a binary success flag and the traversed path length for each episode.

Compute $TSR$ as the fraction of successful episodes and $SE$ as the average optimal‑to‑actual path‑length ratio.

Aggregate results across environments to obtain the final benchmark scores.

The agent starts at position 0 m and moves forward 2 m, then turns left and moves 1 m, finally proceeds straight 3 m to the door.

Total traversed distance = 2 m + 1 m + 3 m = 6 m.

Success flag = 1 (door opened).

$SE = 5 m / 6 m \approx 0.83$; $TSR = 1/1 = 100\%$ for this single episode.

This example shows that an agent can succeed (high $TSR$) while still being inefficient (sub‑optimal $SE$), highlighting why both metrics are needed.

System Architecture and Data Pipeline

Describes the modular pipeline, observation, and unified action design of SpatialWorld.

The SpatialWorld architecture isolates environment execution from agent reasoning via a five‑component closed‑loop pipeline, with strict observation and action interfaces that standardize interaction across heterogeneous simulators.

The pipeline stitches together environment, verification, observation, action, and agent modules so that each step receives a clean observation, emits a high‑level decision, and feeds back deterministic success signals.

How does this closed‑loop differ from a typical reinforcement‑learning loop?

Standard RL loops treat the environment as a black box and rely on dense reward signals; the SpatialWorld loop replaces dense rewards with a single deterministic verification step and forces the agent to act on a vision‑only observation, removing any privileged state access.

All simulators expose a single symbolic action set $\\mathcal{A}$ that groups navigation, viewpoint/posture, interaction, and task‑control primitives under a common API.

Why not expose simulator‑specific actions directly?

Direct exposure would tie the policy to a single engine, preventing transfer; the unified set abstracts away low‑level differences while preserving the expressive power needed for embodied tasks.

Environment Interface loads the initial state (either a saved snapshot or an action list).

Verification checks that the loaded state satisfies task preconditions.

Observation Interface renders an egocentric RGB frame at native resolution.

Agent Module consumes the frame and outputs a symbolic action $a \\in \\mathcal{A}$.

Action Interface maps $a$ to the backend’s command API and executes it.

Verification re‑evaluates the terminal condition; if false, loop back to Observation.

**Figure 3.** The Observation and Action Interfaces. (a) Flexible environment initialization via direct state loading or action-list execution. (b) A unified interface providing standardized egocentric RGB observations. (c) A structured, unified action space $\mathcal{A}$. (d) Action-to-code mapping that translates unified actions into environment-specific commands, enabling cross-simulator deployment.

Together, the closed‑loop pipeline, vision‑only observation, and unified action space define a reproducible, simulator‑agnostic testbed for embodied spatial reasoning.

Experimental Setup

Experimental setup details the models, inputs, benchmark scale, and evaluation protocol.

The step budget is set dynamically to $2g + 10$ instead of a fixed limit.

Derived from the golden action count $g$ annotated per task.

We benchmark fifteen state-of-the-art MLLMs—including Qwen, GLM, Kimi, Gemini, GPT, and Seed series—using their official APIs or open-weight checkpoints without any task-specific fine‑tuning. Each model receives an egocentric RGB screenshot and a natural‑language task description at every step, with no privileged state information. Evaluation runs on the full SPATIALWORLD suite of 760 tasks across eight simulated environments, reporting Task Success Rate (TSR) and Solution Efficiency (SE) aggregated over all trajectories.

Performance Results and Analysis

MLLMs still fall far short of human spatial ability, with modest gains across domains.

The SpatialWorld benchmark evaluates embodied agents with a closed‑loop protocol, measuring both task success (TSR) and path efficiency (SE). This section reports how current MLLMs perform across those axes.

State‑of‑the‑art MLLMs achieve at most 14.4 % Physical Overall TSR, far below human‑level spatial competence.

Table 3 shows GPT‑5 leading with 14.4 % while the next best, Qwen‑3.5‑397B‑A17B, reaches 12.2 %.

GPT‑5.4 attains higher Spatial Efficiency than Kimi‑K2.5 despite comparable TSR.

Table 4 reports SE = 0.569 for GPT‑5.4 versus 0.486 for Kimi‑K2.5, while their Physical Overall TSRs are 9.2 % and 6.6 % respectively.

**Table 3.** Performance Evaluation. Main-benchmark TSR (%) across task categories for 15 evaluated models. Bold and underlined entries denote the best and second-best per column. Physical categories follow the benchmark scenario taxonomy; digital corresponds to the 3D game suite. Physical Overall is the weighted average of Daily, Work, Entertain., Travel, and Social categories.

**Figure 5.** Indoor and outdoor physical domains. Overall TSR across indoor and outdoor physical environments, with environment-level bars for the top-five models in each domain.

**Table 4.** Efficiency Evaluation. Main-benchmark SE across task categories. Bold and underlined denote the best and second-best per column. Physical Overall is the weighted mean over successful valid physical trajectories; - indicates no successful trajectory.

**Figure 6.** **Complexity profile.** Task counts, mean TSR, and mean SE across the three parallel complexity modes in the physical benchmark.

**Table 7.** Performance of Game. TSR (%) by game families. Bold and underlined entries denote the best/second-best values in each column. B3D denotes Block3D, and M3D denotes Maze3D.

Ablation Studies

Ablation results reveal model‑specific optimal temperature, history window, and action parameterization.

We evaluate three inference‑time knobs—sampling temperature, history‑window length, and action‑parameterization—to expose model‑specific sensitivities.

Sampling temperature $τ\!=\!1.0$ yields the highest TSR for every model except Gemini‑3‑Flash.

Figure 7a shows a clear peak at $τ\!=\!1.0$ for all evaluated models; Gemini‑3‑Flash attains its best TSR at $τ\!=\!0.7$.

A history window of $w\!=\!30$ frames maximizes TSR for the majority of models.

Figure 7b illustrates monotonic improvement up to $w\!=\!30$, after which performance plateaus or slightly declines.

Action‑parameterization shows no universal winner: $\Delta$TSR (continuous − discrete) is positive for some models (e.g., Gemini‑3‑Flash) and negative for others (e.g., Gemini‑2.5), so we adopt discrete actions for the benchmark.

Observation Sensitivity Analysis

Ablation studies on resolution and environment diversity show stable observation sensitivity and broad spatial coverage.

Observation Sensitivity measures how a model’s spatial reasoning degrades when the visual input is altered, e.g., by lowering resolution.

How does Observation Sensitivity differ from a standard accuracy metric?

Accuracy aggregates all sources of error, while Observation Sensitivity isolates the impact of the visual input alone, revealing whether spatial reasoning is brittle to observation degradation.

The Environment Suite aggregates eight distinct 3D backends behind a unified API, letting agents interact with a common observation and action space while each backend preserves its native scale, dynamics, and object affordances.

Failure Mode Analysis

This section outlines the benchmark’s practical constraints, resource costs, and observed failure patterns.

Existing spatial benchmarks evaluate agents passively; SpatialWorld instead closes the loop by requiring agents to act, observe, and adapt within a modular 3‑D environment.

GPT‑5.4 often terminates early, under‑travels, and exits before completing the second subgoal, whereas GPT‑5 takes extra navigation steps, proceeds more slowly, but ultimately reaches both required waypoints.

Human annotators construct reference trajectories during benchmark creation; these trajectories serve as validation artifacts to confirm task feasibility rather than as a separate leaderboard.

Like most embodied‑AI benchmarks, SPATIALWORLD runs in simulated, near‑photorealistic environments; extending evaluation to physical robots remains future work, and the current 760 handcrafted tasks are modest compared with automatically generated datasets.

SPATIALWORLD is intended as a diagnostic tool to surface failure modes of multimodal agents; however, improved spatial reasoning could be misused for surveillance or unintended physical manipulation, so the community should develop safety guidelines.

Evaluating proprietary models via APIs and open‑source models on an 8× NVIDIA H200 GPU server consumed roughly 5 000 GPU‑hours in total.

The authors used an OpenAI LLM (GPT‑5) solely for grammar, phrasing, and layout edits; it did not contribute to research design, implementation, or analysis, and all substantive work was performed by the authors.

Several failure modes emerged: agents often suffered spatial disorientation, leading to premature termination after repeated ineffective moves; they also exhibited object hallucination, repeatedly attempting actions on nonexistent objects, causing action loops.

**Figure 9.** Observation Sensitivity Analysis under the Same Viewpoint with Varying Resolutions. We progressively increase the resolution ratio along the x-axis, reaching the highest clarity at 1.0.

Related Work and Conclusion

We contextualize prior multimodal agents, simulators, and spatial reasoning benchmarks, then recap our key findings.

Related work clusters around three themes: multimodal agents, 3D simulation platforms, and spatial‑reasoning evaluation.

Multimodal agents combine unified perception of text, images, and video with multi‑step planning to accomplish complex tasks via tool use or direct actions. Early efforts split into two streams: strengthening foundation models for richer representations and building agentic pipelines that enable iterative planning in interactive worlds.

Numerous 3D simulators support spatial reasoning and embodied decision‑making. AI2‑THOR offers near‑photorealistic indoor scenes with object affordances; Habitat provides a modular, sensor‑configurable framework for navigation and embodied QA; CARLA and MetaDrive simulate urban traffic with realistic vehicle dynamics. Recent extensions broaden scenario diversity, yet most platforms remain domain‑specific and expose heterogeneous task definitions.

Spatial reasoning for multimodal agents entails grounding goals in egocentric observations, maintaining an evolving belief about object locations, and planning under partial observability. Prior benchmarks largely rely on static 2D VQA or fixed‑camera video, while newer suites probe 3D and video streams. Existing multi‑turn benchmarks either restrict agents to 2D screens or adopt disparate embodied interfaces, making cross‑domain comparison difficult. SPATIALWORLD addresses this gap with a closed‑loop protocol that evaluates agents on egocentric RGB input and high‑level instructions, measuring success via terminal‑state verifiers.

In conclusion, SPATIALWORLD provides a unified testbed that isolates active egocentric exploration and decision‑making under partial observability. Evaluations of fifteen leading multimodal language models reveal a stark weakness: while static perception is strong, dynamic physical interaction yields low task success rates, poor spatial efficiency, and high variance across domains. These results highlight the need to shift research toward robust interactive spatial reasoning.

**Figure (a)** Social collaboration TSR on Multi-AI2THOR and Multi-ProcTHOR; the dark marker denotes the pooled social score.

**Figure.** (b) Resolution-scale probe on AI2THOR. (c) Camera-field-of-view probe on AI2THOR.

Benchmark Construction Details

Supplementary benchmark details and the full comparison table for SPATIALWORLD.

This appendix expands the benchmark construction and evaluation described earlier. It supplies a complete benchmark comparison table, task‑specific evaluation criteria, and fine‑grained performance breakdowns across indoor/outdoor settings and digital game families.

Table 5 contrasts SPATIALWORLD with prior ImageQA, VideoQA, and embodied‑agent benchmarks along five dimensions: unified cross‑platform interface, interactive environment, first‑person observation, vision‑only input, and language‑form output.

**Table 5.** Detailed spatial benchmark comparison. Extended version of Table 1, including the full set of representative ImageQA, VideoQA, and embodied-agent benchmarks used to motivate the benchmark construction.

Beyond the table, the appendix provides per‑environment performance numbers for indoor versus outdoor scenes and for each digital game family, enabling researchers to pinpoint strengths and weaknesses of individual agents.

Task-Specific Evaluation Adaptations

Supplementary evaluation details for task‑specific scores and environment breakdowns.

In the Snake3D environment a binary success flag is too coarse, so performance is reported as a scale‑normalized discrete score. The score is computed by dividing the agent’s achieved snake length by the spatial edge length of the arena, yielding a finer‑grained progress measure.

Table 6 splits the benchmark into indoor (AI2THOR, ProcTHOR, VirtualHome) and outdoor (CARLA, EmbodiedCity) domains, omitting multi‑agent settings which are covered elsewhere. This granularity shows that GPT‑5 and Qwen‑3.5‑397B‑A17B excel on indoor tasks that require precise object grounding, while GPT‑5 and Gemini‑3‑Flash lead outdoors where long‑range navigation is critical.

**Table 6.** **Indoor-outdoor.** The TSR (%) of the single-agent physical benchmark across indoor and outdoor environments. Bold and underlined entries denote the best and second-best values in each column, respectively. Multi-agent environments are excluded here and analyzed separately in Figure 8a. The overall columns pool the environments within each domain and are located at the right edge of each domain group.

Table 7 reports per‑game‑family TSR across five digital games, with Snake scores normalized by edge length and capped at 100 %. Gemini‑3.1‑Pro attains the highest overall TSR (39 %) driven by strong Block3D (40 %) and Snake (90 %) results, while Gemini‑3‑Flash dominates Rubik’s Cube (50 %). Conversely, Qwen3‑VL‑235B‑Thinking shines on Maze (70 %) and Maze3D (32 %), illustrating that topological traversal tasks favor different architectural strengths.

Across all environments the consistently low scores on Rubik’s Cube and Block3D expose a fundamental bottleneck: current agents struggle with multi‑step spatial manipulation and complex state transformations. This pattern suggests that future work should prioritize explicit geometric reasoning capabilities to close the gap.

Human Annotation Pipeline

Appendix details the annotation pipeline, unified action space, and supplemental model analyses.

The annotation effort for SpatialWorld follows a strict three‑stage workflow. First, annotators author a natural‑language instruction and set the initial scene configuration. Second, they independently execute the task in the simulator, recording the terminal state and a reference action trajectory. Third, a cross‑check validates feasibility, instruction clarity, and script correctness.

**Figure 2.** Data construction pipeline of SpatialWorld. We first collect a series of environments, have annotators learn tutorials and write instructions, define success conditions, and then calibrate the data through automated execution validation in virtual environments and human cross-validation.

**Table.** Examples of task instructions and their corresponding success conditions in a simulated environment.

The unified action interface abstracts low‑level simulator commands into high‑level text primitives that MLLMs can emit directly. It defines four canonical groups—Navigation, Viewpoint & Posture, Interaction, and Task‑Control—each with a small set of parameters. The special value 0 denotes an explicit no‑motion or wait decision, useful for scenarios such as pausing at a traffic light.

**Table.** Unified action space for MLLM agents.

GPT‑5 solves 78 of the 609 shared tasks (12.8 %) while GPT‑5.4 solves 40 (6.6 %). The gap is largest in VirtualHome (+28.9 points) and smallest in EmbodiedCity (‑3.8 points). GPT‑5’s advantage stems from a more persistent termination policy: it continues exploring until verification succeeds, whereas GPT‑5.4 often issues premature DONE or FAIL decisions.

Qualitative inspection of 100 failed trajectories reveals three dominant failure modes: Spatial Disorientation, Object Hallucination, and Premature Termination. Each mode maps to specific unified actions—e.g., “Move” failures for disorientation or “Interact” failures for hallucination. Understanding these patterns guides future improvements to spatial reasoning and action selection.

Environment Action Specifications

Qualitative failure analysis and environment action specifications.

Move(...,0) is defined as a no‑op wait. Step sizes vary by environment: AI2‑THOR, ProcTHOR, and VirtualHome use Small = 0.25 m, Medium = 0.5 m, Large = 1 m; EmbodiedCity uses 0.5 / 2.0 / 5.0 m; CARLA uses coarse route‑progress steps (4 / 10 m for vehicles, 3 / 10 m for pedestrians). Continuous ProcTHOR also supports exact numeric meters.

Viewpoint and body‑stance granularity depend on the backend. Defaults are 90° rotation and 30° tilt in AI2‑THOR/ProcTHOR, 30°/90° in VirtualHome, and 5°/15°/45° in EmbodiedCity. Angles are freely tunable in continuous settings.

The framework subsumes grasping, placement, and manipulation under an object‑centric interface. AI2‑THOR and ProcTHOR expose the broadest set of persistent state changes (e.g., cook, slice, fill); VirtualHome implements a smaller subset (Grab, SwitchOn, Drink). Backend‑specific names are merely wrappers.

EndTask triggers evaluator verification of the terminal goal; this mechanism is exposed only in CARLA. The Communicate channel is active exclusively in collaborative multi‑agent tasks and uses structured output tags.

Action Loop is defined as the agent cycling through the same ineffective actions until the step budget is exhausted. We select four representative bad cases—spatial disorientation, object hallucination, premature termination, and action loop—for detailed qualitative analysis.

Figures 11‑14 present failure cases for state‑of‑the‑art models across AI2‑THOR, VirtualHome, CARLA, and ProcTHOR. GPT‑5 exhibits spatial disorientation and premature termination, Gemini‑3.1‑Pro shows object hallucination and action loop, and Qwen‑3.5‑397B‑A17B also suffers from action loops. Table 10 contrasts GPT‑5 with GPT‑5.4, highlighting that the latter often terminates early while the former continues probing until verification succeeds.

**Table 10.** Representative GPT-5 vs. GPT-5.4 disagreement cases. These examples illustrate the recurring pattern that GPT-5.4 often terminates after a short or partial trajectory, while GPT-5 spends more actions and eventually satisfies the verifier.

Read the original paper

Open the simplified reader on Paperglide