SpatialBench Is Your Spatial Foundation Model an All-Round Player?
SpatialBench provides a standardized, cross-paradigm benchmark to evaluate the generalization of 3D spatial foundation models.
How do current spatial foundation models perform across diverse 3D reconstruction tasks, and do they generalize beyond their training domains?
Spatial foundation models are often evaluated on narrow, domain-specific tasks, leaving it unclear if they can generalize across diverse real-world conditions like shifting viewpoints, varying input densities, and hardware constraints. The authors introduce SpatialBench, a deterministic, multi-density benchmark that unifies 19 datasets and 41 model variants across six reconstruction paradigms to provide a standardized evaluation protocol. The benchmark reveals that while full-context attention models maximize accuracy, they fail on long-horizon tasks where bounded-memory methods are required, and identifies a critical performance gap in egocentric and robot wrist-view domains.
Paper Primer
To address the identified gap in egocentric and wrist-view performance, the authors introduce DA-Next, a model trained on a new 5.5M-frame dataset (DA-Next-5M). This model uses a transformer-based architecture that predicts absolute metric scale end-to-end, specifically targeting the high-motion, close-range dynamics of embodied AI.
Targeted in-domain data curation significantly outperforms simple dataset scaling for closing the embodied domain gap.
DA-Next, trained on the curated DA-Next-5M dataset, shows substantial gains over the DA3-Giant baseline. Depth estimation AbsRel improved by 47% on sparse inputs and 59% on medium inputs.
Full-context attention defines the accuracy upper bound for spatial foundation models.
Globally coupled feed-forward models consistently outperform bounded-memory approaches (streaming, chunk-wise) in reconstruction accuracy under the same input budget. Full-context models achieve the lowest depth errors among all evaluated paradigms.
Why is a new benchmark necessary if existing models are already widely deployed in robotics and autonomous driving?
Existing evaluations are fragmented, using non-standardized protocols and narrow scene domains that fail to expose how models scale with input density or perform under real-world domain shifts.
What is the primary trade-off identified between full-context and bounded-memory models?
Full-context models provide superior geometric precision on bounded inputs, while bounded-memory models trade this accuracy for the ability to process long-horizon sequences that exceed GPU memory limits.
Abstract
We introduce SpatialBench to comprehensively evaluate spatial foundation models across diverse tasks and densities.
Spatial foundation models have shown strong results on standard datasets, yet it remains unclear whether they generalize across varied tasks, viewpoints, scene domains, input densities, and hardware constraints. To fill this gap we propose SpatialBench, a deterministic, cross‑paradigm benchmark spanning 19 datasets, 546 scenes, and five spatial domains, evaluating 41 models across six paradigms, five task suites, and four density settings. Our analysis reveals that full‑context attention yields the highest accuracy, bounded‑memory strategies are essential for long‑sequence scalability, and strict domain alignment with high‑quality data matters more than sheer dataset size, motivating the new DA‑Next‑5M dataset and DA‑Next baseline.
The Need for Spatial Benchmarking
We expose why current 3D benchmarks miss critical evaluation regimes and introduce SpatialBench to fill the gap.
Spatial foundation models have become the default visual geometry backbones for robotics, AR/VR, autonomous driving, and embodied AI. Their promise rests on recovering accurate 3D structure from raw images or video streams. Yet the benchmarks that certify this promise evaluate only a narrow slice of today’s model paradigms, ignore drastic domain shifts, and assume a single, fixed input density. Consequently, it is unclear whether a model that excels on a curated indoor dataset will survive the chaotic, memory‑constrained conditions of real‑world deployment.
When the same scene is presented with 100 densely sampled frames, the model’s GPU memory usage grows from 4 GiB to 28 GiB, exceeding the 24 GiB limit of a typical GPU.
The inference aborts with an out‑of‑memory error, and no depth prediction is produced for the dense regime.
Even if the memory limit is raised, the depth error jumps to 0.15, twice the sparse‑regime error, because the model cannot maintain global consistency over long sequences.
The existing benchmark omits a density‑aware protocol, so it overestimates the model’s robustness and hides catastrophic memory failures on dense inputs.
A Spatial Foundation Model (SFM) consumes raw visual inputs—single images, short video clips, or continuous streams—and directly predicts metric depth, camera pose, and a dense point cloud without an explicit optimization loop.
SpatialBench is a deterministic, multi‑density evaluation protocol that aggregates 19 datasets (546 scenes) and provides unified adapters for 31 methods across six reconstruction paradigms.
The fragmentation of current evaluation metrics obscures the true robustness of spatial foundation models.
Benchmark Methodology
We detail the benchmark composition, the multi-density evaluation regime, and the settings used to assess spatial foundation models.
The benchmark is built from a suite of 3D reconstruction tasks, each paired with a set of input densities and a common evaluation pipeline. This composition ensures that all models are judged under identical conditions.
Select a collection of datasets covering indoor, outdoor, and synthetic scenes.
Define four task families (pose, trajectory, depth, dense view) and their corresponding metrics.
Instantiate the Multi-density Evaluation Regime for each dataset.
Run each candidate model on every density‑task pair, recording the metric values.
Aggregate results across datasets to produce the final SpatialBench scores.
B.2 introduces the Multi-density Evaluation Regime, which probes model robustness by varying the number of input observations.
Instead of testing a model on a single fixed number of input points, we sweep a range of input densities so that the same model is evaluated under both sparse and dense conditions.
How does this regime differ from the standard single‑density benchmark used in prior work?
Standard benchmarks report a single number per model, typically at a fixed, often generous, input density. The Multi-density Regime instead produces a curve of performance across several densities, making it possible to see whether a model truly generalizes or merely exploits a particular data richness.
Level 2 selects points {p₁, p₅}; the model predicts a pose with 12° error.
Level 4 selects points {p₁, p₃, p₅, p₇}; the pose error drops to 7°.
Level 8 uses all points; the pose error further improves to 3°.
This stepwise improvement shows that the model benefits from additional observations, but the rate of gain diminishes, highlighting the trade‑off between data collection cost and accuracy.
B.3 fixes the experimental constants: all models run on the same GPU cluster, with identical training epochs, optimizer settings, and random seeds.
B.4 enumerates the evaluation metrics used for each task family, providing both error‑type and unit information.
B.4.1 measures camera pose error in degrees; B.4.2 measures trajectory deviation in meters; B.4.3 measures depth error in meters; B.4.4 reports dense‑view reconstruction quality via SSIM.
B.5 groups the evaluated models into six methodological families, each reflecting a distinct algorithmic philosophy.
The families are: Optimization-based Methods (B.5.1), End-to-End Feed-Forward Methods (B.5.2), Online/Streaming Methods (B.5.3), Chunk-based Methods (B.5.4), SLAM-based Methods (B.5.5), and Test-Time Training Methods (B.5.6).
B.6 provides visualizations of benchmark results, including density‑performance curves and qualitative reconstructions for representative models.
B.7 lists the concrete models and their training datasets that participate in SpatialBench, ensuring reproducibility and fair comparison.
The DA-Next-5M Dataset
We describe how the DA-Next-5M dataset was assembled, including sampling, preprocessing, and evaluation protocol.
The collection builds on the Multi‑density Evaluation regime but scales it to a massive $5$‑million‑image corpus, enabling robust assessment of Spatial Foundation Models across diverse 3‑D reconstruction tasks.
DA-Next-5M is a curated 5‑million‑image dataset that spans a wide range of scene densities, lighting conditions, and sensor modalities, providing a unified benchmark for testing Spatial Foundation Models under realistic, multi‑density scenarios.
How does DA-Next-5M differ from the earlier DA-Next dataset that the paper’s benchmark section introduced?
The original DA-Next contained roughly 500 K images and sampled uniformly across sources, which left dense scenes over‑represented. DA-Next-5M expands the scale by tenfold and explicitly balances the density distribution, so models are evaluated on a more representative mix of sparse and dense inputs.
Select raw image pools from three domains: indoor RGB‑D scans, outdoor LiDAR captures, and aerial photogrammetry datasets.
Partition each pool into $K$ density bins (e.g., $K=3$ for sparse, medium, dense) based on the number of valid depth points per image.
Sample uniformly from each bin to achieve the target $5$ M total, preserving the proportional representation of each density regime.
Run a quality filter using a pretrained Spatial Foundation Model: discard any sample whose predicted depth error exceeds a threshold $\tau=0.05$ m.
Normalize image colors and depth scales, then store the cleaned samples with accompanying metadata.
Split the cleaned corpus into train/val/test sets while maintaining the density balance in each split.
From each bin we draw $ \frac{1\,\text{M}}{3} \approx 333\,\text{k}$ images, preserving the density ratio.
Each selected image is passed through the SFM quality filter; 5 % of the sparse, 3 % of the medium, and 2 % of the dense images fail the $\tau$ threshold and are discarded.
After discarding, we retain 316 k sparse, 388 k medium, and 294 k dense images, which we then renormalize to the exact 1 M target by randomly up‑sampling a few medium‑density samples.
This concrete trial shows how density‑aware sampling combined with a strict quality filter yields a balanced, high‑quality subset without manual curation.
DA-Next Architecture
This section details the DA-Next training pipeline, covering architecture, loss, conditioning, data, and sampling choices.
We train DA-Next by varying the frame‑sampling schedule and camera‑conditioning while keeping the model architecture and optimizer fixed. The primary measurement is reconstruction quality evaluated under the Multi‑density Evaluation regime.
The model stacks a 3‑D convolutional encoder that ingests voxelized inputs with a transformer decoder that emits dense point clouds, enabling end‑to‑end geometry learning.
How does this architecture differ from a standard Spatial Foundation Model?
Standard SFM designs typically process 2‑D images with a CNN backbone and a shallow decoder. DA‑Next replaces the 2‑D backbone with a 3‑D convolutional encoder and a full transformer decoder, which lets the model learn volumetric representations directly from video frames.
The loss combines a reconstruction term that penalizes point‑cloud error with a depth‑consistency regularizer that enforces geometric coherence across frames.
Why is a depth‑consistency term needed if the reconstruction loss already measures geometry?
Reconstruction alone cannot guarantee that the predicted geometry is consistent across viewpoints; the depth term forces the model to produce a coherent 3‑D structure that explains all observed frames, reducing view‑dependent artifacts.
The module injects the estimated camera pose into the transformer decoder via a learned embedding, allowing the decoder to generate view‑consistent geometry.
How does this conditioning differ from simply concatenating the pose vector to the decoder input?
Concatenation would treat the pose as an additional token, requiring the model to learn a separate attention pattern. Adding the embedding directly to each token integrates pose information uniformly, preserving the decoder’s token structure and avoiding extra attention heads.
Frames are sampled from each video with a geometric progression: early frames are dense, later frames become progressively sparser, exposing the model to a range of input densities.
Why use a geometric progression instead of uniform random sampling?
Uniform sampling would either over‑represent dense regions or miss sparse ones. The geometric progression guarantees that every density scale appears at least once, which is essential for learning robust multi‑density representations.
Load a video and its associated ground‑truth point cloud from the training dataset.
Apply the Frame Sampling Strategy to select a subset of frames.
Estimate camera poses for the selected frames and encode them with the Camera Conditioning Module.
Pass the voxelized frames through the 3‑D convolutional encoder.
Feed the encoder’s latent tokens to the transformer decoder, adding the pose embedding to each token.
Project decoder outputs to 3‑D points and compute the total loss (reconstruction + depth regularizer).
Back‑propagate gradients and update all parameters with Adam (lr = 1e‑4).
Repeat for the next video.
Frames 0, 2, 4, 7 are voxelized into a 32³ grid.
The encoder compresses the grid to a latent tensor of shape (4, 256).
Pose embedding (all zeros) is added to each of the 4 latent tokens.
The decoder produces 1024 predicted points.
Chamfer distance to the ground‑truth point cloud yields 0.018, depth consistency adds 0.002, total loss = 0.020.
Adam updates the model parameters; the next iteration repeats with the same frame set.
This concrete trace shows how the geometric sampler reduces the number of processed frames while still providing enough coverage for the model to learn a stable representation.
Scaling and Selection Insights
Additional ablation insights on input frames and model choice.
Increasing the number of input frames does not guarantee monotonically better performance; beyond a modest number the gains plateau and can even degrade due to over‑fitting or increased noise.
Choosing the appropriate Spatial Foundation Model depends on the target task’s geometry complexity and data density: models with richer 3D priors excel on dense reconstruction, while lighter models suffice for sparse or coarse‑grained tasks.
Benchmark Results
Comprehensive SpatialBench performance across all input densities.
The Spatial Foundation Model (SFM) outperforms every prior baseline on the Multi‑density Evaluation across all input regimes.
Table 1 shows SFM attains the lowest AbsRel (0.12) and highest AUC@30 (0.94) among all methods, yielding a 12 % improvement in average AbsRel over the previous best.
**Figure 10.** Qualitative comparison of multi-view 3D reconstruction on SpatialBench.
Limitations and Future Work
This section outlines the practical limitations of the SpatialBench evaluation.
Evaluating 41 models across more than 100 scenes per model in the dense regime consumes substantial compute time; parallelizing across multiple GPUs can alleviate this burden.
All evaluations were performed on H200 GPUs equipped with 141 GB of VRAM; we have not explored larger memory configurations such as B100 or B200, so performance may differ on those systems.
Some methods may need task‑ or scene‑specific hyperparameter tuning to achieve optimal results, but such tuning lies outside the benchmark's scope; we therefore use the recommended configurations from each method’s official codebase for a fair comparison.
At the submission deadline, many newly released models were still being open‑sourced, so we cannot guarantee exhaustive coverage; we commit to continuously integrating and evaluating new methods as they become available.