SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

SpatialClaw uses a persistent Python kernel as an action interface to enable flexible, multi-step spatial reasoning.

How can an agentic action interface improve spatial reasoning in vision-language models beyond standard code generation or structured tool-calling?

Vision-language models struggle with complex 3D spatial reasoning because they lack a way to iteratively compose and verify geometric evidence from pixels. SpatialClaw replaces rigid tool-call menus or one-shot code generation with a persistent Python kernel, allowing the agent to write, inspect, and revise code across multiple steps. This approach achieves 59.9% average accuracy across 20 benchmarks, outperforming existing spatial agents by 11.2 points without task-specific tuning.

Paper Primer

Existing spatial agents are bottlenecked by their action interfaces: single-pass code execution forces the model to commit to a strategy before seeing intermediate results, while structured tool-calls limit the agent to a predefined menu of operations. SpatialClaw treats code as an orchestration space: the agent writes one executable Python cell per step, and the kernel preserves all variables—such as masks, depth maps, and point clouds—for use in subsequent reasoning steps.

SpatialClaw significantly improves performance on complex spatial reasoning tasks compared to structured tool-call interfaces.

Across 20 benchmarks, SpatialClaw achieved an average accuracy of 59.9%, a +11.2 point gain over the next best spatial agent (SpaceTools).

The agent spontaneously adapts its tool composition to the question type, using scientific libraries like NumPy and SciPy to perform geometric computations that were not explicitly anticipated by the system designers. This flexibility is most effective in dynamic 4D video reasoning and multi-view inference, where the agent must chain geometric operations across frames and viewpoints.

Why is a persistent Python kernel superior to a structured tool-call interface for spatial reasoning?

Structured tool-calls limit the agent to predefined command schemas, making it difficult to compose perception outputs in ways not anticipated at design time. A persistent kernel allows the agent to treat perception outputs as ordinary variables, enabling iterative inspection, revision, and custom composition using standard scientific libraries.

Does SpatialClaw require model-specific tuning or reinforcement learning to achieve these gains?

No; SpatialClaw is a training-free framework. It uses a unified system prompt that encodes general principles of spatial reasoning, allowing it to transfer across different VLM backbones (ranging from 27B to 397B parameters) without modification.

For agentic spatial reasoning, the design of the action interface is as critical as the model backbone; providing an expressive, stateful environment for code execution allows agents to solve complex geometric problems that fixed-API systems cannot compose.

The Spatial Reasoning Challenge

We expose how static tool interfaces limit VLM spatial reasoning and introduce SpatialClaw’s persistent code loop.

Vision-Language Models (VLMs) excel at joint image‑text tasks but struggle with spatial reasoning—determining where objects are, how they move, and how they relate in three dimensions. This shortfall stems from static action interfaces that either force a one‑shot program or restrict the agent to a fixed set of tool calls, preventing inspection of intermediate visual evidence and dynamic composition of perception results. By treating code as a persistent, multi‑turn interface, SpatialClaw enables agents to iteratively plan, execute, and revise their analysis, directly addressing the brittleness of static VLM reasoning.

An Action Interface specifies how an agent invokes perception tools and how their outputs are represented for subsequent reasoning steps.

**Figure 1.** **SpatialClaw improves spatial reasoning across the board.** Per-benchmark accuracy on 20 spatial reasoning benchmarks (Gemma 4-31B backbone), split into two panels by task category. Each axis is individually rescaled so SpatialClaw traces the constant-radius ring. Baselines are SpaceTools-Toolshed (Chen et al., 2026), pySpatial (Luo et al., 2026), and a no-tool backbone.

**Figure 1.** Comparison of different approaches for solving a spatial reasoning task. (a) Single-pass code generation, (b) Structured tool-call sequence, and (c) SpatialClaw (Ours) which incorporates intermediate verification and iterative refinement.

Static VLM reasoning cannot adapt to intermediate visual evidence, forcing brittle one‑shot analysis.

The SpatialClaw Framework

SpatialClaw enables agents to iteratively generate and run code, inspecting visual evidence at each step.

Prior spatial agents rely on either a single‑pass code execution or a fixed set of structured tool calls. Both approaches force the agent to commit to a full analysis before any intermediate visual evidence is observed. This limitation motivates a third interface that treats code generation itself as a flexible, iterative action.

SpatialClaw treats code generation as the action interface, letting an agent write one Python cell at a time, run it in a persistent kernel, and condition the next cell on the resulting variables, images, and text feedback.

How does this differ from the single‑pass code approach used in earlier agents?

In the single‑pass setting the entire program is written before any output is seen, so the agent cannot react to intermediate masks or errors. SpatialClaw instead writes one cell, observes the kernel state (variables, images, errors), and then decides the next cell, enabling dynamic refinement.

The workspace is a long‑lived Python environment where every object created by a code cell—arrays, masks, plots, or error messages—remains available to later cells, allowing the agent to build up and revise spatial analyses step by step.

Step 1:

Step 2:

Step 3:

Step 4:

This walk‑through shows how a variable created in one cell (the depth map) can be inspected, visualized, and reused in later cells without re‑computing.

Planning: the agent receives the question and tool documentation, then produces a high‑level analysis plan.

Code Generation: the agent writes a single Python cell that implements the next sub‑task.

Code Execution: the cell runs in the persistent kernel, updating variables and possibly emitting images.

Feedback Assembly: the system captures stdout, variable summaries, and any images registered via

Answer Submission: when the agent decides it has enough evidence, it calls

**Figure 3.** Agentic loop for iterative code execution. SpatialClaw wraps a persistent kernel in a five-stage loop. A planner receives the question and tool documentation but not the images, and produces an analysis plan. The main agent generates a Python cell executed in the persistent kernel. Feedback comprising stdout, variable summaries, and images registered via show() is appended to the model context. The loop continues until the agent submits an answer with ReturnAnswer() or the step count has reached the predefined maximum $N_{max}$.

Performance and Benchmarks

SpatialClaw consistently outperforms baselines across all spatial reasoning benchmarks.

The SpatialClaw agent operates inside a persistent kernel and follows a deterministic five‑stage loop: planning, code generation, execution, feedback assembly, and answer submission. Each stage is governed by a standard agent control structure, while the system prompt supplies the spatial reasoning discipline.

The kernel is a long‑lived Python environment that retains variables, images, and intermediate results across successive code cells.

How does a Persistent Kernel differ from a regular one‑off code execution?

In a regular execution the process starts fresh each time, discarding prior variables. The Persistent Kernel keeps the entire session alive, so later steps can directly reuse earlier masks, plots, or numeric results without re‑running the same code.

The loop proceeds as follows: the planner (Stage I) writes a high‑level analysis plan; the main VLM (Stage II) translates the next sub‑task into Python; the static checker (Stage III) validates the cell before execution; the kernel runs the cell and returns outputs; Stage IV packages those outputs as feedback for the next iteration; finally, Stage V ends the process when ReturnAnswer() yields a properly formatted response or when the step limit $N_{\text{max}}$ is reached.

SpatialClaw outperforms all baselines on every benchmark.

Table 1 reports an average score of 59.9 for SpatialClaw, exceeding the next‑best method by a clear margin across 20 spatial reasoning tasks.

SpatialClaw’s consistent outperformance across benchmarks demonstrates the value of a persistent, iterative execution loop for spatial reasoning.

Analysis of Reasoning Gains

SpatialClaw beats all baselines, especially on video and multi‑view tasks.

SpatialClaw outperforms all baselines, achieving a +11.2 % point gain over the strongest competitor (SpaceTools) on average.

Table 3 shows SpatialClaw’s average margin of +11.2 % points compared to SpaceTools across the 20 benchmarks.

**Figure 4.** Pairwise win/loss margin of SpatialClaw over baselines across 13 meta-categories. SpatialClaw outperforms both (a) Structured tool-call and (b) Single-pass Code in 11/13 categories. The largest gains concentrate in categories that demand multi-step geometric composition.

**Figure 6.** Attribution of SpatialClaw's wins over structured tool-call via LLM-as-judge. Over half of the gains are driven by code composition, 19.5% by control flow, and 28.3% are interface-neutral wins on perceptual tasks unaffected by the action interface.

Failure Modes and Insights

SpatialClaw iteratively runs code to inspect and refine visual evidence, boosting spatial reasoning.

SpatialClaw replaces static tool‑calling with a persistent, iterative code‑execution loop. This lets the agent inspect visual evidence and refine its plan across steps.

The agent repeatedly runs short code snippets, each building on the state left by the previous one, allowing it to adjust its reasoning as new visual information becomes available.

Ablation (I) removes all utility wrappers (e.g., tools.Mask, tools.Geometry) while retaining core perception tools (SAM3/DA3) and the scientific libraries (numpy, scipy). Despite this reduction, performance matches that of the full SpatialClaw system, showing that the persistent kernel with scientific primitives can largely compensate for missing utilities. Ablation (II) drops the perception tools, leaving only the code‑as‑action interface, and still delivers a +2.7 % gain over the no‑tool baseline, isolating the interface’s contribution.

The agent spontaneously adapts its tool composition to the question type. Heat‑map analysis of primitive usage across 13 meta‑categories reveals that distance‑type questions heavily invoke KD‑tree search and norm operations, while direction‑type questions rely on dot‑product calculations.

SpatialClaw’s gains are largest precisely where chained geometric computation across frames and viewpoints is required. Compared with Structured tool‑call and Single‑pass Code baselines, SpatialClaw shows a net advantage in 11 of 13 categories, with the biggest lifts (+6–9 percentage points) in Camera motion, Multi‑view/viewpoint reasoning, and Relative direction.

**Figure 5.** Composition adapts to the question type. Primitive usage frequency across meta-categories.

**Figure 7.** Failure-mode breakdown of incorrect agent sessions. Each session is classified by an LLM-as-Judge (Gemini-3.1-Pro (Team et al., 2023)) into one of 11 fine-grained failure categories.

Related Work and Background

Supplementary details on related work, evaluation, system design, and tool APIs.

This supplement expands on prior work, evaluation setup, system components, and tool specifications that underpin the SpatialClaw agent.

Efforts that fine‑tune VLMs with 3D supervision or embed geometry modules, achieving fast inference but requiring retraining when perception components change.

LLMs that compose calls to specialist vision modules, either synthesizing a full program in one pass or dispatching structured tool menus.

Agents that combine LLM planning with geometric constraints, 3D scene graphs, or iterative view selection, yet typically lack turn‑by‑turn code generation.

Framework where the LLM emits executable Python code instead of JSON actions, improving flexibility for general‑purpose agents.

The evaluation protocol follows the 20 spatial reasoning benchmarks from the main results, grouping them into five categories and applying a uniform per‑sample scoring scheme.

Additional analysis attributes gains to code composition (≈50 %), control flow (≈19.5 %), and interface‑neutral factors (≈28.3 %). Geometric reasoning errors dominate failure modes, with perception hallucinations as a secondary source.

Two baselines share the full agent loop and perception tools; they differ only in the per‑step action format.

Single‑pass Code Generation collapses the iterative loop into one turn, forcing the agent to produce a complete Python cell without observing any intermediate tool output.

Structured Tool‑Call Sequence retains the multi‑step loop but restricts each step to a single JSON‑encoded tool invocation, disallowing arbitrary Python expressions.

Agent System Design details the two serving roles, input preprocessing, persistent kernel semantics, security sandbox, per‑frame contracts, and comprehensive error handling.

System Configuration runs the language model via vLLM and hosts perception services (Depth Anything 3, SAM3) behind lightweight HTTP endpoints, enabling independent scaling.

Input images are resized to a 768‑pixel long edge and capped at 32 frames; longer videos are uniformly sampled.

The IPython kernel persists across steps, preserving variables such as masks, reconstructions, and intermediate results for reuse.

The sandbox statically analyses ASTs to reject unsafe patterns (file I/O, exec, imports) and returns rejection reasons as feedback.

Per‑frame containers record absolute frame indices; mismatched indices raise immediate exceptions during composition.

Error handling routes runtime exceptions, format violations, sandbox rejections, and timeouts back to the agent as observations, allowing in‑episode revision.

Prompt Details enumerate four VLM roles, response format, visual access primitives, tool signatures, coordinate conventions, robust‑computation principles, and budget constraints.

Tool API Reference expands the brief tool descriptions into full signatures and behavior for Reconstruct, SAM3, Geometry, Mask, Time, Graph, and Draw.

tools.Reconstruct wraps Depth Anything 3, returning per‑frame depth, intrinsics, extrinsics, and point clouds aligned with SAM3 masks.

tools.SAM3 provides image‑ and video‑mode segmentation APIs, returning PerFrameMask objects with absolute frame indices.

tools.Geometry offers common numeric primitives such as Euclidean distance, angle computation, projection, and RANSAC ground‑plane fitting.

tools.Mask supplies mask statistics (centroid, area, IoU) over boolean masks returned by SAM3.

tools.Time converts between frame indices and seconds based on metadata frame rate.

tools.Graph renders line plots from numeric sequences and attaches a concise textual summary.

tools.Draw creates PIL overlays for bounding boxes, lines, and points on images.

Limitations stem from perception quality; broader impact highlights training‑free spatial reasoning extensions for deployed VLMs.

**Table 6.** Evaluation backbones for SpatialClaw.

Questions & answers

What is SpatialClaw and what is its main contribution?

SpatialClaw is a training-free agentic framework for 3D spatial reasoning that introduces a persistent Python kernel as its action interface, allowing a vision-language model to write one executable code cell per step, observe intermediate results, and revise its approach across multiple steps rather than committing to a full program upfront.

What problem does SpatialClaw address?

SpatialClaw addresses the bottleneck created by static action interfaces in existing spatial agents: single-pass code execution forces the model to commit to a full strategy before seeing any intermediate visual evidence, while structured tool-call menus restrict the agent to a predefined set of operations that cannot be flexibly composed.

Why do vision-language models struggle with spatial reasoning?

VLMs struggle with complex 3D spatial reasoning because they lack a mechanism to iteratively compose and verify geometric evidence from pixels; static action interfaces prevent them from adapting their analysis in response to intermediate visual outputs.

How does SpatialClaw's persistent kernel differ from single-pass code execution?

In single-pass execution the entire program is written before any output is seen, so the agent cannot react to intermediate masks or errors. SpatialClaw's persistent kernel writes one cell, observes the kernel state (variables, images, errors), and then decides the next cell, enabling dynamic refinement across steps.

Why is a persistent Python kernel superior to a structured tool-call interface for spatial reasoning?

Structured tool-calls limit the agent to predefined command schemas, making it difficult to compose perception outputs in ways not anticipated at design time. A persistent kernel allows the agent to treat perception outputs as ordinary Python variables, enabling iterative inspection, revision, and custom composition using standard scientific libraries such as NumPy and SciPy.

What is the five-stage loop that governs SpatialClaw's operation?

SpatialClaw follows a deterministic five-stage loop: (I) planning, where a high-level analysis plan is written; (II) code generation, where the main VLM translates the next sub-task into Python; (III) static checking, where a checker validates the cell before execution; (IV) feedback assembly, where kernel outputs are packaged for the next iteration; and (V) answer submission, which ends the process when ReturnAnswer() yields a result.

What benchmarks and evaluation setup does SpatialClaw use?

SpatialClaw is evaluated on 20 spatial reasoning benchmarks grouped into five categories, using a uniform per-sample scoring scheme. The paper does not name all 20 benchmarks individually but groups results into 13 meta-categories for analysis.

What are SpatialClaw's key quantitative results?

SpatialClaw achieves 59.9% average accuracy across 20 benchmarks, outperforming existing spatial agents by 11.2 percentage points. Its largest gains (+6–9 percentage points) occur in Camera motion, Multi-view/viewpoint reasoning, and Relative direction categories, and it shows a net advantage in 11 of 13 meta-categories over both Structured Tool-Call and Single-pass Code baselines.

Does SpatialClaw require task-specific tuning or reinforcement learning?

No; SpatialClaw is a training-free framework that uses a unified system prompt encoding general principles of spatial reasoning, allowing it to transfer across different VLM backbones ranging from 27B to 397B parameters without modification.

What do the ablation studies reveal about SpatialClaw's performance drivers?

Ablation (I) removes all utility wrappers (e.g., tools.Mask, tools.Geometry) while retaining core perception tools and scientific libraries, and performance matches the full system, showing the persistent kernel with scientific primitives can largely compensate for missing utilities. Ablation (II) drops the perception tools entirely, causing a larger performance drop, indicating perception tools are more critical than utility wrappers.

How does additional analysis attribute SpatialClaw's performance gains?

The paper attributes gains to code composition (approximately 50%), control flow (approximately 19.5%), and interface-neutral factors (approximately 28.3%).

What are the main failure modes of SpatialClaw?

Geometric reasoning errors dominate SpatialClaw's failure modes, with perception hallucinations as a secondary source of failures.

What perception tools and scientific libraries does SpatialClaw use?

SpatialClaw uses Depth Anything 3 (wrapped as tools.Reconstruct) for depth estimation and point clouds, SAM3 (tools.SAM3) for image and video segmentation, and standard scientific libraries including NumPy and SciPy for geometric computations. Additional utilities include tools.Geometry, tools.Mask, tools.Time, tools.Graph, and tools.Draw.

How does SpatialClaw handle security and error management in code execution?

A sandbox statically analyses ASTs to reject unsafe patterns such as file I/O, exec, and unauthorized imports, returning rejection reasons as feedback to the agent. Runtime exceptions, format violations, sandbox rejections, and timeouts are all routed back to the agent as observations, allowing in-episode revision.

How does SpatialClaw adapt its tool usage to different question types?

The agent spontaneously adapts its tool composition to the question type: heat-map analysis across 13 meta-categories shows that distance-type questions heavily invoke KD-tree search and norm operations, while direction-type questions rely on dot-product calculations.

What are the input preprocessing constraints in SpatialClaw?

Input images are resized to a 768-pixel long edge and capped at 32 frames; longer videos are uniformly sampled to fit within this limit.

What are the limitations of SpatialClaw?

The paper states that SpatialClaw's limitations stem from perception quality, meaning errors in depth estimation or segmentation propagate into downstream geometric reasoning. The paper does not report limitations related to computational cost or latency.

How does SpatialClaw compare to its two main baselines?

The two baselines—Single-pass Code Generation and Structured Tool-Call Sequence—share SpatialClaw's full agent loop and perception tools but differ only in the per-step action format. Single-pass collapses the loop into one turn, while Structured Tool-Call restricts each step to a single JSON-encoded tool invocation; SpatialClaw outperforms both by 11.2 points on average across 20 benchmarks.

What venue, authors, and date are associated with SpatialClaw?

The paper does not specify author names or the publication venue. It is available on arXiv at arxiv.org/abs/2606.13673, but the paper does not state a specific submission or publication date beyond what the arXiv identifier implies.

Key terms

SpatialClaw: A training-free agentic framework for 3D spatial reasoning that uses a persistent Python kernel as its action interface, enabling iterative code writing, inspection, and revision across multiple steps.
persistent Python kernel: A code execution environment that remains alive across multiple agent steps, preserving all variables, masks, and intermediate results so later steps can reuse earlier computations without re-running code.
structured tool-call interface: An action interface that restricts an agent to selecting from a predefined menu of JSON-encoded commands, preventing arbitrary code composition or use of operations not anticipated at design time.
single-pass code generation: An approach where the agent writes a complete Python program in one turn before observing any intermediate output, preventing it from adapting its strategy based on intermediate visual evidence.
VLM (Vision-Language Model): A neural network model trained to process and reason jointly over both image and text inputs, used here as the backbone for spatial reasoning agents.
Depth Anything 3 (DA3): A perception model used by SpatialClaw to estimate per-frame depth, camera intrinsics, extrinsics, and point clouds from images or video frames.
SAM3: A segmentation model used by SpatialClaw that provides image- and video-mode segmentation APIs, returning per-frame mask objects with absolute frame indices.
tools.Reconstruct: A SpatialClaw utility wrapper around Depth Anything 3 that returns per-frame depth maps, camera intrinsics, extrinsics, and point clouds aligned with SAM3 segmentation masks.
tools.Geometry: A SpatialClaw utility offering common numeric geometric primitives such as Euclidean distance, angle computation, projection, and RANSAC ground-plane fitting.
tools.Mask: A SpatialClaw utility that computes mask statistics—including centroid, area, and IoU—over boolean masks returned by SAM3.
PerFrameMask: A data structure returned by SAM3 that records segmentation masks along with their absolute frame indices to prevent mismatched composition across video frames.
AST (Abstract Syntax Tree) sandbox: A static code analysis mechanism in SpatialClaw that inspects the structure of generated Python code before execution to reject unsafe patterns such as file I/O or unauthorized imports.
KD-tree search: A spatial data structure and associated search algorithm used by SpatialClaw agents to efficiently find nearest neighbors in point clouds, particularly for distance-type spatial reasoning questions.
RANSAC ground-plane fitting: A robust algorithm that estimates the equation of a ground plane from noisy 3D point data by iteratively fitting a model while rejecting outliers.
vLLM: A high-throughput inference serving framework used by SpatialClaw to run the language model backbone efficiently.
4D video reasoning: Spatial reasoning that involves understanding objects and their geometric relationships across both three spatial dimensions and time, requiring chained computations across video frames.
multi-view inference: Spatial reasoning that requires integrating geometric information from multiple camera viewpoints to answer questions about object positions or relationships.
training-free framework: A system that achieves its capabilities using only a fixed system prompt and existing pretrained model weights, without any additional fine-tuning, reinforcement learning, or task-specific training.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers