Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park

Vision-language models often conflate vertical image position with depth, relying on perspective shortcuts rather than true 3D reasoning.

Do vision-language models (VLMs) actually understand 3D spatial depth, or are they just using the 2D vertical position of objects in an image as a proxy for distance?

Vision-language models (VLMs) achieve high scores on spatial benchmarks, but it is unclear if this reflects genuine 3D understanding or a reliance on statistical shortcuts found in natural photographs. The authors introduce a representation-level analysis framework that measures how spatial axes are organized within model embeddings, alongside a synthetic benchmark, SpatialTunnel, that decouples vertical image position from depth. They find that models consistently conflate vertical position with distance—a "vertical-distance entanglement"—and that models with more structured, separable spatial representations are significantly more robust to distribution shifts.

Paper Primer

The paper identifies "vertical-distance entanglement": because objects farther away on a ground plane naturally appear higher in a 2D image, models learn to use vertical position as a proxy for depth. This shortcut allows models to pass standard benchmarks while failing when the vertical-depth correlation is broken.

The authors use contrastive probing to measure the geometric relationship between spatial axes in hidden states. They define "axis coherence" to track how stable a spatial direction is, and a "VD-Entanglement Index" to quantify the degree to which vertical and distance representations are coupled.

Benchmark accuracy overestimates spatial reasoning capabilities due to perspective bias.

Models consistently perform significantly better on "consistent" examples (where vertical position aligns with depth) than on "counter-heuristic" examples (where it does not). On EmbSpatial-Bench, a top-performing model showed a 36.9-percentage-point accuracy gap between consistent and counter-heuristic subsets.

Internal representation structure predicts robustness.

Models with higher "distance coherence" and lower "VD-Entanglement Index" maintain higher accuracy when evaluated on the bias-controlled SpatialTunnel benchmark. Models with well-separated spatial axes exhibit greater robustness across diverse benchmarks, including BLINK and CV-Bench.

Why does this problem matter for current VLM development?

As VLMs are increasingly deployed in robotics and embodied agents, relying on 2D statistical shortcuts rather than 3D structure leads to brittle performance when the model encounters environments that deviate from standard photographic statistics.

What is the primary advantage of the SpatialTunnel benchmark over existing datasets?

SpatialTunnel uses a tunnel-shaped synthetic environment to decouple vertical image position from depth, allowing researchers to test spatial reasoning in configurations where the "higher equals farther" heuristic is explicitly removed.

Researchers should prioritize representation-level diagnostics over aggregate benchmark scores, as high accuracy can mask a reliance on brittle, perspective-driven shortcuts that fail in novel 3D environments.

Introduction: The Spatial Reasoning Gap

We expose a hidden vertical‑distance shortcut in VLMs and propose a synthetic benchmark to reveal it.

Vision‑Language Models (VLMs) process images and text jointly to perform reasoning tasks, yet they often rely on a vertical‑distance shortcut—assuming that higher image positions imply greater depth. This hidden bias can cause systematic failures when vertical position and 3D distance are decoupled.

In many VLMs the embedding direction that grows when an object moves upward in the image also grows when the object is farther away, so vertical position and depth share a single latent axis.

The model normalizes vertical positions: $\tilde y_A = 0.2$, $\tilde y_B = 0.8$.

It maps the normalized vertical coordinate onto the learned depth direction, producing depth estimates $\hat d_A \propto \tilde y_A$ and $\hat d_B \propto \tilde y_B$.

Thus $\hat d_B > \hat d_A$, so the model predicts the higher balloon to be farther, despite both objects being at the same true distance.

This concrete walk‑through shows how the shared vertical‑depth direction forces the model to misinterpret “higher but close” as “farther away”.

**Fig. 1:** Many VLMs answer spatial questions via a perspective-driven shortcut, e.g., objects located higher in the image are further away in 3D. By confusing 2D vertical position with 3D distance, models fail systematically on counter examples. Our SpatialTunnel benchmark and contrastive probing expose this vertical-distance entanglement. In contrast, strong spatial VLMs show disentangled axes and consistent correctness across both real and synthetic settings.

The paper investigates whether VLM spatial reasoning is grounded in 3D understanding or 2D statistical shortcuts.

Behavioral Evidence of Shortcut Reliance

VLMs show a large accuracy drop on counter examples that defy the vertical‑distance shortcut.

VLMs often infer depth from the vertical position of objects, a bias we call vertical‑distance entanglement. This section quantifies how that bias harms spatial reasoning.

A consistent example follows the natural elevation cue: the farther object appears higher in the image, matching the expected vertical‑distance relationship.

A counter example violates the elevation cue: the farther object appears lower in the image, contradicting the vertical‑distance heuristic.

**Fig. 2:** *Consistent* vs. *counter* examples. *Consistent*: Farther object appears higher in the image; *Counter*: Farther object appears lower.

Across five VLM families, accuracy on counter examples is substantially lower than on consistent examples.

Average gap ≈30 percentage points; e.g., Qwen2.5‑VL fine‑tuned on 2 M samples: 60.9 % vs 24.0 %.

Models consistently perform worse on counter examples where vertical position contradicts depth.

Decoupling Depth with SpatialTunnel

We probe VLM depth reasoning using the synthetic SpatialTunnel benchmark.

SpatialTunnel is a synthetic tunnel scene where an object’s depth $z$ and its vertical image‑plane position are decoupled, allowing angular position $\theta$ to be varied without changing $z$.

How does SpatialTunnel differ from a standard 2‑D image dataset that mixes depth cues?

In SpatialTunnel the depth $z$ of each object is fixed while its angular coordinate $\theta$ is varied, so the vertical image location changes without any change in true distance. Conventional datasets conflate vertical position, size, and occlusion, making it impossible to isolate the vertical‑distance shortcut.

Generate SpatialTunnel scenes by sweeping angular positions.

Compute screen coordinates: $\theta_1=0^\circ$ maps to the image centre, $\theta_2=90^\circ$ maps to the right edge.

Render the scene: both objects appear at the same size because depth is identical, but A is centered vertically while B is shifted right.

Query “Is A closer than B?” – the ground‑truth answer is “No” because depths are equal; the model’s prediction will depend on its reliance on vertical cues.

Even though depth is identical, the vertical‑position change alone can flip the model’s answer if it has learned the vertical‑distance shortcut.

**Fig. 3:** SpatialTunnel holds the two objects at fixed depths while sweeping their angular positions around the tunnel cross-section, so that 2D image-plane layout varies independently of depth ordering.

**Fig. 4:** Mean accuracy heatmaps on SpatialTunnel for Molmo-7B. Each cell indexes a joint angular configuration ($\theta_1, \theta_2$) of the two objects (red = higher accuracy; blue = lower). Gray indicates configurations outside the subset. From base $\rightarrow$ 400k $\rightarrow$ 2M training samples, accuracy on (a) perspective-consistent cells improves steadily. In contrast, (b) counter cells remain substantially harder, with the largest drop at 400k and a partial recovery at 2M.

Given a rendered image with two objects, the VLM receives a binary depth‑comparison question (“Is obj₁ closer than obj₂?”). We compute $p$ as above, then define the correctness score $v$ as $p$ for a “Yes” ground truth or $1-p$ for a “No”. Aggregating $v$ yields four metrics: mean accuracy $v$, consistent accuracy $v_{\text{cons}}$, counter accuracy $v_{\text{ctr}}$, and the accuracy gap $\Delta = v_{\text{cons}} - v_{\text{ctr}}$, which quantifies vertical‑distance entanglement.

All evaluated VLMs show a positive accuracy gap $\Delta$, confirming pervasive vertical‑distance entanglement.

Table 3 reports $\Delta$ values from +0.021 (Molmo‑7B + 400k) up to +0.416 (Molmo‑7B + 800k), with every model exhibiting $\Delta>0$.

Probing Internal Spatial Representations

We probe internal model representations to diagnose how spatial axes are encoded and why some models reason robustly.

Previous sections established that Vision-Language Models (VLMs) often rely on a vertical-distance shortcut, assuming higher objects are farther away. We now move beyond behavioral accuracy to analyze the internal representations that drive this behavior, identifying the structural signatures of robust spatial reasoning.

To isolate how a model encodes spatial relationships, we compare its internal state when asked about a relationship versus its inverse — the difference between these states reveals the model's "spatial direction" vector.

**Fig. 5: Contrastive probing for representation-level spatial analysis.** Given a spatial-relation VQA, we construct a minimal question pair by swapping the object order, which flips the ground-truth relation. We extract the final-token hidden state at an intermediate layer for each question and compute a delta vector as their difference, isolating the relational displacement in embedding space. Aggregated across samples, these vectors summarize the model's internal spatial representations and enable diagnosing systematic confounds among spatial cues.

We quantify representation quality using two metrics: Axis Coherence measures if a model encodes a spatial direction consistently, while the VD-Entanglement Index (VD-EI) measures if vertical and distance axes are incorrectly coupled.

Distance Coherence ($\text{Coh}_D$) emerges as a critical diagnostic. Across models, $\text{Coh}_D$ is consistently the weakest axis, and its stagnation during training often correlates with a failure to resolve the vertical-distance shortcut.

**Fig. 6:** Internal probing analysis of spatial representations. (a) Positive correlation between behavioral accuracy on counter examples and internal distance coherence (`Coh_D`). (b) Comparing distance coherence (`Coh_D`) against geometric entanglement (VD-EI) within the NVILA family; RoboRefer occupies a unique region of high coherence and low entanglement. Unlabeled points denote base models, and numeric labels (e.g., 80k) indicate data-mix fine-tuned variants.

**Fig. 7: PCA of delta vectors across models.** Each point is a delta vector colored by axis (orange: horizontal, green: vertical, purple: distance), with darker/lighter shades distinguishing opposing categories within each axis (e.g., left vs. right). Molmo (2M), NVILA (2M), and Qwen (2M) show separation along the horizontal and vertical axes, but distance delta vectors remain poorly distinguished. RoboRefer and Qwen3 exhibit three clearly separated clusters, each aligned with a distinct principal component.

Our analysis suggests that robust spatial reasoning requires both high distance coherence and low VD-EI. While scaling data can improve coherence, it does not guarantee the disentanglement of vertical and distance axes, which appears to require richer training regimes.

Conclusion and Summary

Appendix A derives the geometric link between depth and vertical image position on a ground plane.

We introduced a representation‑level diagnostic framework that exposes vertical-distance entanglement as a pervasive bias across VLM families. Models with high distance coherence and low VD‑Entanglement Index achieve stronger counter‑heuristic robustness and higher accuracy on spatial reasoning benchmarks.

To isolate this bias from dataset confounds we created the synthetic benchmark SpatialTunnel, which removes perspective‑driven correlations present in natural images.

We derive how a standard pinhole camera projects points on a flat ground plane, yielding a monotonic relationship between depth and vertical image coordinate. Under zero tilt, the vertical image coordinate $v_{\text{img}} = f H_c / Z$, so greater depth corresponds to a higher vertical position in the image.

Experimental Setup Details

Full experimental details: models, data sources, mixes, and benchmarks.

Appendix B records the concrete experimental setup that underlies the results in Section 3, covering the vision‑language models, the spatial datasets that feed the training mixes, the composition of those mixes, and the evaluation benchmarks.

The five spatial datasets span synthetic, simulated, and real‑world domains. SAT provides procedurally generated 3‑D indoor scenes with fully automatic QA pairs. RoboSpatial supplies real indoor scans paired with egocentric RGB images and 3‑D annotations. SPAR‑7M aggregates multi‑view and single‑view QA from large indoor 3‑D corpora. RefSpatial combines web images, embodied videos, and simulated scenes to yield 2.5 M RGB‑D samples and 20 M QA pairs. PRISM offers synthetic tabletop grasping tasks with language‑conditioned instructions.

**Table 6.** Per-dataset sample counts for each data mix scale. The 80k–800k mixes use uniform allocation across datasets. The 2M mix uses all available samples from smaller datasets and subsamples larger ones (RefSpatial at ~3.3%, SAT and PRISM in full).

Benchmarks assess spatial reasoning from two complementary angles. EmbSpatial‑Bench evaluates embodied, egocentric tasks by rendering RGB‑D views from 3‑D environments and probing six agent‑centric relations (above, below, left, right, close, far). CV‑Bench repurposes standard vision datasets (ADE20K, COCO, Omni3D) into multiple‑choice VQA items that test 2‑D perception and 3‑D depth ordering.

The provided image contains a table comparing various models (Molmo-7B-O-0924, NVILA-Lite-2B, RoboRefer-SFT-2B, Qwen2.5-VL-3B, and Qwen3-VL-235B) across two metrics: "Rel. Depth" and "Spat. Rel."

For the small BLINK subsets we report Wilson 95 % confidence intervals, providing a calibrated view of performance variability on the Rel. Depth (n = 124) and Spat. Rel. (n = 143) splits.

SpatialTunnel Implementation Details

Appendix C details the SpatialTunnel setup, VQA protocol, proprietary model results, and object-size analysis.

All SpatialTunnel scenes are rendered in Blender with a 2 m × 2 m square tunnel. Each scene contains two objects—obj$_1$ (farther) and obj$_2$ (nearer)—positioned at independent angular coordinates while keeping depth fixed, yielding image pairs that differ only in 2‑D layout.

For each scene we randomize shape (sphere or cube), color (seven choices), size (base $s_1\!=\!0.2$, $s_2\!=\!0.1$ scaled by $[1.0,1.5]$), and lighting (Nishita sky with sun rotation $[1.25\pi,1.75\pi]$). This yields $16\times16\times12 = 3{,}072$ rendered images.

Depth‑comparison questions are posed with four templates that flip object order and polarity (e.g., “Is the obj$_1$ closer …?”). Ground truth for the four templates is No, Yes, No, Yes respectively. Across all angular configurations we generate 12,288 question‑image pairs and evaluate using the probability‑based protocol of Section 4.2, averaging template‑level correctness per configuration.

We evaluate three proprietary configurations: GPT‑5.2 (default), GPT‑5.2 with reasoning, and Gemini‑2.5‑Pro. Using exact‑match accuracy on the four templates, GPT‑5.2 (default) attains $0.613$ mean accuracy with a gap $\Delta\!=\!0.120$, reasoning improves it to $0.953$ and reduces the gap to $\Delta\!=\!0.058$, while Gemini‑2.5‑Pro reaches $0.919$ mean accuracy with a slight negative gap $\Delta\!=\!-0.028$ (Table 8).

To probe reliance on object size, we construct a size‑controlled variant where $s_1 + s_2 = 0.4$ and sweep $s_1$ over 11 values, moving from size‑consistent (farther object smaller) to size‑conflicting (farther object larger) regimes. The size‑bias gap $\Delta_s = v_{s_1=0.1} - v_{s_1=0.3}$ quantifies dependence on apparent size; larger positive $\Delta_s$ indicates stronger size‑based shortcut.

Results (Table 9) mirror the vertical‑distance findings: models improve when size agrees with depth and degrade when it conflicts. Qwen remains near chance with negligible $\Delta_s$, while fine‑tuned Molmo‑7B (2M) achieves high mean accuracy $v\!=\!0.801$ but a large positive gap $\Delta_s\!=\!0.335$, indicating amplified size bias. RoboRefer‑2B‑SFT attains $v\!=\!0.804$ with a modest gap $\Delta_s\!=\!0.061$, suggesting greater robustness to both vertical‑position and size shortcuts.

**Table 1.** Performance comparison of various models across different training stages and baselines, measuring accuracy metrics ($Acc_{all}$, $Acc_{con}$, $Acc_{ctr}$) and the delta ($\Delta$).

**Fig. 8: Object-size variation in SpatialTunnel.** A representative scene rendered under six $(s_1, s_2)$ configurations with $s_1 + s_2 = 0.4$, where $s_1$ and $s_2$ denote the sizes of $obj_1$ and $obj_2$, respectively. $obj_1$ is always farther from the camera than $obj_2$. As $s_1$ increases from left to right, the farther object grows while the nearer object shrinks, moving from a size-consistent to a size-conflicting configuration.

**Fig. 9: Correctness as a function of object size.** Mean logit-based correctness, averaged over all question templates, as a function of $obj_1$ size (bottom axis) and $obj_2$ size (top axis), with $s_1 + s_2 = 0.4$. Each curve corresponds to one training-data variant. Molmo and NVILA show clear degradation as the farther object becomes larger than the nearer one, whereas Qwen remains near chance throughout, indicating weak depth reasoning in this setting.

**Table.** Performance comparison of different models across varying training steps (base, +80k, +400k, +800k, +2M) using metrics $v$, $v_{s_1=0.1}$, $v_{s_1=0.3}$, and the difference $\Delta_s$.

Contrastive Probing Methodology

Additional methodological details and full visualizations for contrastive probing.

Swap pair construction creates minimal contrastive examples by swapping the two queried objects. For horizontal and vertical relations the question “Is A left/right of B?” becomes “Is B left/right of A?”, flipping the ground‑truth label while preserving all visual context.

Distance pairs require a different recipe because EmbSpatial‑Bench poses depth queries as four‑choice questions. We locate the target object from the correct answer, then sample a reference uniformly from the remaining distractors; swapping their roles yields the same label‑flip structure as the horizontal/vertical cases.

VD‑EI (Vertical‑Distance Entanglement Index) is positive when perspective‑aligned pairs (above↔far, below↔close) are more similar than perspective‑opposing pairs, and near zero when the two groups cancel out. It reaches its maximum when aligned cosines are high and opposing cosines are negative.

Layer selection follows a three‑step priority: (1) pick a layer where coherence across horizontal, vertical, and distance axes plateaus; (2) ensure VD‑EI is stable at that depth; (3) avoid the final few layers that are tuned for next‑token prediction. When criteria conflict we keep the layer that satisfies the first two while remaining in the intermediate region.

Robustness checks sample 1 000 random layers within the candidate range and recompute the cross‑model Distance‑Coherence ranking; the resulting Spearman $\rho$ of 0.928 confirms that the reported ordering is insensitive to the exact layer choice.

Figure 10 compares Distance‑Coherence on the synthetic SpatialTunnel benchmark versus the real EmbSpatial‑Bench. Although absolute values differ, the relative ordering of models within each family is largely preserved, especially for the NVILA family where the ranking is identical across domains.

**Fig. 10:** Distance Coherence measured on synthetic (SpatialTunnel) vs. real (EmbSpatial-Bench) datasets. Gray bars denote $Coh_D$ computed on SpatialTunnel; red dots denote $Coh_D$ on EmbSpatial-Bench. Although the absolute magnitudes differ across domains, the relative ordering within each model family is largely preserved. The red box highlights the NVILA family, where the ranking is identical across both datasets.

Heatmaps (Figures 11–13) reveal that opposite directions on the same axis (e.g., left vs. right) have cosine similarity near –1, confirming antiparallel encoding. Horizontal categories are largely orthogonal to vertical and distance axes, while vertical–distance pairs (above↔far, below↔close) exhibit positive similarity, evidencing the entanglement described earlier.

**Fig. 11:** Cross-category similarity heatmaps for the Molmo family. Each cell shows the cosine similarity between mean delta vectors of two categories. Variants range from vanilla (base Molmo-7B) to 2M (SFT with 2M-sample data mix).

**Fig. 12:** Cross-category similarity heatmaps for the NVILA family. Variants include NVILA-Lite-2B from vanilla (base) through 2M (SFT), plus RoboRefer (RoboRefer-2B-SFT).

**Fig. 13:** Cross-category similarity heatmaps for the Qwen family. Variants include Qwen2.5-VL-3B-Instruct (vanilla through 2M) and Qwen3-VL-235B-A22B-Instruct.

2‑D PCA visualizations (Figures 14–16) illustrate that directional categories (left/right, above/below) separate cleanly along principal components, while distance categories (far/close) cluster near the origin, confirming weaker separation.

**Fig. 14:** 2D PCA of delta vectors for the Molmo family. Each point represents a per-sample delta vector, colored by spatial category. Opposing categories (e.g., left vs. right) separate along shared principal components, while far/close overlap with above/below, reflecting vertical-distance entanglement.

**Fig. 15:** 2D PCA of delta vectors for the NVILA family. RoboRefer shows notably tighter distance-axis clusters (far/close) separated from vertical categories, consistent with its higher `Coh_D` and lower VD-EI.

**Fig. 16:** 2D PCA of delta vectors for the Qwen family. Variants include Qwen2.5-VL-3B-Instruct and Qwen3-VL-235B-A22B-Instruct. Qwen3-VL-235B exhibits markedly cleaner cluster separation across all three axes.

3‑D PCA visualizations (Figures 17–19) extend these observations, showing that larger models develop a distinct distance axis in addition to horizontal and vertical axes, especially for Qwen3‑235B.

**Fig. 17:** 3D PCA of delta vectors for the Molmo family. A distinct distance axis does not clearly emerge, although delta vectors in the horizontal and vertical axes appear more well-clustered with data scaling.

**Fig. 18:** 3D PCA of delta vectors for the NVILA family. RoboRefer’s distance clusters (far/close) occupy a distinct subspace from vertical categories, unlike the fine-tuned variants.

**Fig. 19:** 3D PCA of delta vectors for the Qwen family. Variants include Qwen2.5-VL-3B-Instruct and Qwen3-VL-235B-A22B-Instruct. Qwen3-VL-235B shows clear three-way separation among horizontal, vertical, and distance axes in 3D space.

Read the original paper

Open the simplified reader on Paperglide