SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

SpatialClaw uses a persistent Python kernel as an action interface to enable flexible, multi-step spatial reasoning.

How can an agentic action interface improve spatial reasoning in vision-language models beyond standard code generation or structured tool-calling?

Vision-language models struggle with complex 3D spatial reasoning because they lack a way to iteratively compose and verify geometric evidence from pixels. SpatialClaw replaces rigid tool-call menus or one-shot code generation with a persistent Python kernel, allowing the agent to write, inspect, and revise code across multiple steps. This approach achieves 59.9% average accuracy across 20 benchmarks, outperforming existing spatial agents by 11.2 points without task-specific tuning.

Paper Primer

Existing spatial agents are bottlenecked by their action interfaces: single-pass code execution forces the model to commit to a strategy before seeing intermediate results, while structured tool-calls limit the agent to a predefined menu of operations. SpatialClaw treats code as an orchestration space: the agent writes one executable Python cell per step, and the kernel preserves all variables—such as masks, depth maps, and point clouds—for use in subsequent reasoning steps.

SpatialClaw significantly improves performance on complex spatial reasoning tasks compared to structured tool-call interfaces.

Across 20 benchmarks, SpatialClaw achieved an average accuracy of 59.9%, a +11.2 point gain over the next best spatial agent (SpaceTools).

The agent spontaneously adapts its tool composition to the question type, using scientific libraries like NumPy and SciPy to perform geometric computations that were not explicitly anticipated by the system designers. This flexibility is most effective in dynamic 4D video reasoning and multi-view inference, where the agent must chain geometric operations across frames and viewpoints.

Why is a persistent Python kernel superior to a structured tool-call interface for spatial reasoning?

Structured tool-calls limit the agent to predefined command schemas, making it difficult to compose perception outputs in ways not anticipated at design time. A persistent kernel allows the agent to treat perception outputs as ordinary variables, enabling iterative inspection, revision, and custom composition using standard scientific libraries.

Does SpatialClaw require model-specific tuning or reinforcement learning to achieve these gains?

No; SpatialClaw is a training-free framework. It uses a unified system prompt that encodes general principles of spatial reasoning, allowing it to transfer across different VLM backbones (ranging from 27B to 397B parameters) without modification.

For agentic spatial reasoning, the design of the action interface is as critical as the model backbone; providing an expressive, stateful environment for code execution allows agents to solve complex geometric problems that fixed-API systems cannot compose.

The Spatial Reasoning Challenge

We expose how static tool interfaces limit VLM spatial reasoning and introduce SpatialClaw’s persistent code loop.

Vision-Language Models (VLMs) excel at joint image‑text tasks but struggle with spatial reasoning—determining where objects are, how they move, and how they relate in three dimensions. This shortfall stems from static action interfaces that either force a one‑shot program or restrict the agent to a fixed set of tool calls, preventing inspection of intermediate visual evidence and dynamic composition of perception results. By treating code as a persistent, multi‑turn interface, SpatialClaw enables agents to iteratively plan, execute, and revise their analysis, directly addressing the brittleness of static VLM reasoning.

An Action Interface specifies how an agent invokes perception tools and how their outputs are represented for subsequent reasoning steps.

**Figure 1.** **SpatialClaw improves spatial reasoning across the board.** Per-benchmark accuracy on 20 spatial reasoning benchmarks (Gemma 4-31B backbone), split into two panels by task category. Each axis is individually rescaled so SpatialClaw traces the constant-radius ring. Baselines are SpaceTools-Toolshed (Chen et al., 2026), pySpatial (Luo et al., 2026), and a no-tool backbone.

**Figure 1.** Comparison of different approaches for solving a spatial reasoning task. (a) Single-pass code generation, (b) Structured tool-call sequence, and (c) SpatialClaw (Ours) which incorporates intermediate verification and iterative refinement.

Static VLM reasoning cannot adapt to intermediate visual evidence, forcing brittle one‑shot analysis.

The SpatialClaw Framework

SpatialClaw enables agents to iteratively generate and run code, inspecting visual evidence at each step.

Prior spatial agents rely on either a single‑pass code execution or a fixed set of structured tool calls. Both approaches force the agent to commit to a full analysis before any intermediate visual evidence is observed. This limitation motivates a third interface that treats code generation itself as a flexible, iterative action.

SpatialClaw treats code generation as the action interface, letting an agent write one Python cell at a time, run it in a persistent kernel, and condition the next cell on the resulting variables, images, and text feedback.

How does this differ from the single‑pass code approach used in earlier agents?

In the single‑pass setting the entire program is written before any output is seen, so the agent cannot react to intermediate masks or errors. SpatialClaw instead writes one cell, observes the kernel state (variables, images, errors), and then decides the next cell, enabling dynamic refinement.

The workspace is a long‑lived Python environment where every object created by a code cell—arrays, masks, plots, or error messages—remains available to later cells, allowing the agent to build up and revise spatial analyses step by step.

Step 1:

Step 2:

Step 3:

Step 4:

This walk‑through shows how a variable created in one cell (the depth map) can be inspected, visualized, and reused in later cells without re‑computing.

Planning: the agent receives the question and tool documentation, then produces a high‑level analysis plan.

Code Generation: the agent writes a single Python cell that implements the next sub‑task.

Code Execution: the cell runs in the persistent kernel, updating variables and possibly emitting images.

Feedback Assembly: the system captures stdout, variable summaries, and any images registered via

Answer Submission: when the agent decides it has enough evidence, it calls

**Figure 3.** Agentic loop for iterative code execution. SpatialClaw wraps a persistent kernel in a five-stage loop. A planner receives the question and tool documentation but not the images, and produces an analysis plan. The main agent generates a Python cell executed in the persistent kernel. Feedback comprising stdout, variable summaries, and images registered via show() is appended to the model context. The loop continues until the agent submits an answer with ReturnAnswer() or the step count has reached the predefined maximum $N_{max}$.

Performance and Benchmarks

SpatialClaw consistently outperforms baselines across all spatial reasoning benchmarks.

The SpatialClaw agent operates inside a persistent kernel and follows a deterministic five‑stage loop: planning, code generation, execution, feedback assembly, and answer submission. Each stage is governed by a standard agent control structure, while the system prompt supplies the spatial reasoning discipline.

The kernel is a long‑lived Python environment that retains variables, images, and intermediate results across successive code cells.

How does a Persistent Kernel differ from a regular one‑off code execution?

In a regular execution the process starts fresh each time, discarding prior variables. The Persistent Kernel keeps the entire session alive, so later steps can directly reuse earlier masks, plots, or numeric results without re‑running the same code.

The loop proceeds as follows: the planner (Stage I) writes a high‑level analysis plan; the main VLM (Stage II) translates the next sub‑task into Python; the static checker (Stage III) validates the cell before execution; the kernel runs the cell and returns outputs; Stage IV packages those outputs as feedback for the next iteration; finally, Stage V ends the process when ReturnAnswer() yields a properly formatted response or when the step limit $N_{\text{max}}$ is reached.

SpatialClaw outperforms all baselines on every benchmark.

Table 1 reports an average score of 59.9 for SpatialClaw, exceeding the next‑best method by a clear margin across 20 spatial reasoning tasks.

SpatialClaw’s consistent outperformance across benchmarks demonstrates the value of a persistent, iterative execution loop for spatial reasoning.

Analysis of Reasoning Gains

SpatialClaw beats all baselines, especially on video and multi‑view tasks.

SpatialClaw outperforms all baselines, achieving a +11.2 % point gain over the strongest competitor (SpaceTools) on average.

Table 3 shows SpatialClaw’s average margin of +11.2 % points compared to SpaceTools across the 20 benchmarks.

**Figure 4.** Pairwise win/loss margin of SpatialClaw over baselines across 13 meta-categories. SpatialClaw outperforms both (a) Structured tool-call and (b) Single-pass Code in 11/13 categories. The largest gains concentrate in categories that demand multi-step geometric composition.

**Figure 6.** Attribution of SpatialClaw's wins over structured tool-call via LLM-as-judge. Over half of the gains are driven by code composition, 19.5% by control flow, and 28.3% are interface-neutral wins on perceptual tasks unaffected by the action interface.

Failure Modes and Insights

SpatialClaw iteratively runs code to inspect and refine visual evidence, boosting spatial reasoning.

SpatialClaw replaces static tool‑calling with a persistent, iterative code‑execution loop. This lets the agent inspect visual evidence and refine its plan across steps.

The agent repeatedly runs short code snippets, each building on the state left by the previous one, allowing it to adjust its reasoning as new visual information becomes available.

Ablation (I) removes all utility wrappers (e.g., tools.Mask, tools.Geometry) while retaining core perception tools (SAM3/DA3) and the scientific libraries (numpy, scipy). Despite this reduction, performance matches that of the full SpatialClaw system, showing that the persistent kernel with scientific primitives can largely compensate for missing utilities. Ablation (II) drops the perception tools, leaving only the code‑as‑action interface, and still delivers a +2.7 % gain over the no‑tool baseline, isolating the interface’s contribution.

The agent spontaneously adapts its tool composition to the question type. Heat‑map analysis of primitive usage across 13 meta‑categories reveals that distance‑type questions heavily invoke KD‑tree search and norm operations, while direction‑type questions rely on dot‑product calculations.

SpatialClaw’s gains are largest precisely where chained geometric computation across frames and viewpoints is required. Compared with Structured tool‑call and Single‑pass Code baselines, SpatialClaw shows a net advantage in 11 of 13 categories, with the biggest lifts (+6–9 percentage points) in Camera motion, Multi‑view/viewpoint reasoning, and Relative direction.

**Figure 5.** Composition adapts to the question type. Primitive usage frequency across meta-categories.

**Figure 7.** Failure-mode breakdown of incorrect agent sessions. Each session is classified by an LLM-as-Judge (Gemini-3.1-Pro (Team et al., 2023)) into one of 11 fine-grained failure categories.

Related Work and Background

Supplementary details on related work, evaluation, system design, and tool APIs.

This supplement expands on prior work, evaluation setup, system components, and tool specifications that underpin the SpatialClaw agent.

Efforts that fine‑tune VLMs with 3D supervision or embed geometry modules, achieving fast inference but requiring retraining when perception components change.

LLMs that compose calls to specialist vision modules, either synthesizing a full program in one pass or dispatching structured tool menus.

Agents that combine LLM planning with geometric constraints, 3D scene graphs, or iterative view selection, yet typically lack turn‑by‑turn code generation.

Framework where the LLM emits executable Python code instead of JSON actions, improving flexibility for general‑purpose agents.

The evaluation protocol follows the 20 spatial reasoning benchmarks from the main results, grouping them into five categories and applying a uniform per‑sample scoring scheme.

Additional analysis attributes gains to code composition (≈50 %), control flow (≈19.5 %), and interface‑neutral factors (≈28.3 %). Geometric reasoning errors dominate failure modes, with perception hallucinations as a secondary source.

Two baselines share the full agent loop and perception tools; they differ only in the per‑step action format.

Single‑pass Code Generation collapses the iterative loop into one turn, forcing the agent to produce a complete Python cell without observing any intermediate tool output.

Structured Tool‑Call Sequence retains the multi‑step loop but restricts each step to a single JSON‑encoded tool invocation, disallowing arbitrary Python expressions.

Agent System Design details the two serving roles, input preprocessing, persistent kernel semantics, security sandbox, per‑frame contracts, and comprehensive error handling.

System Configuration runs the language model via vLLM and hosts perception services (Depth Anything 3, SAM3) behind lightweight HTTP endpoints, enabling independent scaling.

Input images are resized to a 768‑pixel long edge and capped at 32 frames; longer videos are uniformly sampled.

The IPython kernel persists across steps, preserving variables such as masks, reconstructions, and intermediate results for reuse.

The sandbox statically analyses ASTs to reject unsafe patterns (file I/O, exec, imports) and returns rejection reasons as feedback.

Per‑frame containers record absolute frame indices; mismatched indices raise immediate exceptions during composition.

Error handling routes runtime exceptions, format violations, sandbox rejections, and timeouts back to the agent as observations, allowing in‑episode revision.

Prompt Details enumerate four VLM roles, response format, visual access primitives, tool signatures, coordinate conventions, robust‑computation principles, and budget constraints.

Tool API Reference expands the brief tool descriptions into full signatures and behavior for Reconstruct, SAM3, Geometry, Mask, Time, Graph, and Draw.

tools.Reconstruct wraps Depth Anything 3, returning per‑frame depth, intrinsics, extrinsics, and point clouds aligned with SAM3 masks.

tools.SAM3 provides image‑ and video‑mode segmentation APIs, returning PerFrameMask objects with absolute frame indices.

tools.Geometry offers common numeric primitives such as Euclidean distance, angle computation, projection, and RANSAC ground‑plane fitting.

tools.Mask supplies mask statistics (centroid, area, IoU) over boolean masks returned by SAM3.

tools.Time converts between frame indices and seconds based on metadata frame rate.

tools.Graph renders line plots from numeric sequences and attaches a concise textual summary.

tools.Draw creates PIL overlays for bounding boxes, lines, and points on images.

Limitations stem from perception quality; broader impact highlights training‑free spatial reasoning extensions for deployed VLMs.

**Table 6.** Evaluation backbones for SpatialClaw.

Read the original paper

Open the simplified reader on Paperglide