Latent Spatial Memory for Video World Models

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, Bohan Zhuang

Mirage maintains 3D-consistent video generation by caching latent features in a 3D spatial memory.

How can we maintain 3D spatial consistency in video world models without the high computational cost of explicit RGB point cloud rendering?

Video world models struggle to maintain 3D consistency over long trajectories, often accumulating geometric drift because they lack a persistent scene representation. Existing solutions use RGB point clouds, but these require expensive, lossy pixel-space rendering and re-encoding at every step. Mirage solves this by storing the diffusion model’s own latent features in a 3D cache, allowing the model to query scene information directly in its native latent space. This approach eliminates the pixel-space bottleneck, achieving up to 10.57× faster generation and a 55× reduction in memory footprint compared to RGB-based baselines.

Paper Primer

Mirage operates through an initialize-readout-update cycle: it lifts initial latent tokens into a 3D cache via depth-guided back-projection, queries this cache by projecting latent features onto target camera grids, and iteratively updates the memory with new frames while filtering out dynamic objects.

Mirage achieves state-of-the-art performance on the WorldScore benchmark for world-consistent video generation.

Mirage outperforms both memory-augmented baselines and foundation video generators on 3D and photometric consistency metrics. 70.36 average score on WorldScore, significantly higher than the 69.73 of the strongest baseline.

Why is storing memory in latent space superior to the standard RGB point cloud approach?

RGB-based memories force a costly round-trip through pixel space (rendering and re-encoding) at every step, which introduces artifacts and discards the model's native latent features. Mirage's latent cache avoids these operations, keeping the conditioning signal perfectly aligned with the diffusion backbone's input space.

What is the scope of Mirage's spatial memory?

Mirage is designed for rigid scene geometry; it explicitly filters out dynamic objects and sky regions during cache updates to prevent unreliable or transient content from contaminating the persistent 3D scaffold.

By moving spatial memory into the latent manifold, Mirage makes long-horizon, geometrically consistent world generation computationally practical, effectively decoupling memory growth from pixel-space rendering costs.

Introduction and Motivation

We expose the inefficiency of RGB‑based point‑cloud memories and introduce latent spatial memory.

Current video world models preserve 3D consistency by storing an explicit RGB point cloud, rendering it each step, and re‑encoding the result. This round‑trip is both computationally expensive and lossy because the VAE reconstruction discards rich latent features.

Instead of caching colored points, the model stores full‑channel latent features at their world‑space coordinates, enabling direct latent‑resolution queries.

**Figure 2.** Latent spatial memory vs. RGB point cloud based memory for world model. Top: prior systems store memory as colored points and pay a rasterise-and-encode round trip at every conditioning step Bottom: latent spatial memory stores latent features at world-space location and reads them back through a single latent-resolution projection, eliminating the per-step pixel-space detour and shrinking the cache footprint by the squared VAE compression factor.

By caching latent features directly, Mirage can query the memory with a single latent‑resolution projection, achieving up to 10.57× faster generation and a 55× reduction in GPU memory compared with RGB‑based baselines.

**Figure 1.** Geometrically consistent videos generated by Mirage with latent spatial memory. Given a single input image and a user-specified camera trajectory (left), Mirage preserves spatial consistency by caching 3D information directly in the latent space, rather than as an RGB-colored point cloud. This design enables memory queries through a single latent-resolution projection, avoiding the costly rasterize-and-encode round trip required by prior RGB point-cloud memories. Consequently Mirage can faithfully return to previously observed regions even after large camera detours, while achieving up to 10.57x faster end-to-end video generation and 55x lower GPU memory usage than RGB-cache baselines.

The key shift is moving from an RGB‑based point‑cloud cache to a latent‑space cache, which removes costly pixel‑space round‑trips and preserves the model’s native features.

The Mirage Method

Mirage builds a persistent latent 3D cache and iterates a readout‑update loop over video chunks.

Standard video world models repeatedly render RGB point clouds and re‑encode them, wasting compute; Mirage instead caches 3D information directly in latent space and reuses it across chunks.

Each latent cell is lifted into world coordinates by unprojecting its pixel location with the depth map and camera pose, turning a 2‑D latent grid into a 3‑D point cloud.

How does this back‑projection differ from the rasterize‑and‑re‑encode step used by RGB‑point‑cloud memories?

RGB pipelines first render a full‑resolution image, then run a separate encoder on the pixels; depth‑guided back‑projection directly lifts latent tokens into 3‑D without any pixel‑space rendering, eliminating the costly render‑encode loop.

Cell $(0,0)$ back‑projects to $p_{00}=(0,0,1)$ and stores $(p_{00},a)$.

Cell $(0,1)$ back‑projects to $p_{01}=(1,0,2)$ and stores $(p_{01},b)$.

Cell $(1,0)$ back‑projects to $p_{10}=(0,1,2)$ and stores $(p_{10},c)$.

Cell $(1,1)$ back‑projects to $p_{11}=(1,1,1)$ and stores $(p_{11},d)$.

The resulting memory $M$ contains four points, each paired with its latent token.

This tiny example shows how a regular 2‑D latent map becomes a 3‑D point cloud, preserving the exact latent features needed for later diffusion steps.

For each target view the cache is projected onto the camera grid (readout), the diffusion backbone denoises the chunk using those latents, then the newly generated frames are encoded and written back into the cache (update).

Project $M$ onto the target view $(E_t,K_t)$, obtain frontmost indices $i_t(u,v)$ and latent tensor $\hat{z}_t$ (Eq. 5).

Form visibility mask $m_t$ indicating which cells received a projection.

Feed $(\hat{z}_t, m_t)$ into the ControlNet side branch; the diffusion backbone denoises the chunk entirely in latent space.

Decode the denoised latents into RGB frames $I_t$.

Estimate depth and pose for $I_t$, encode $I_t$ with the VAE to obtain $\tilde{z}_t$, back‑project $\tilde{z}_t$ into world space, and merge the new points into $M$ (filtering dynamic/sky regions).

Why does Mirage query the cache directly in latent space instead of converting it back to pixels first?

Latent‑space queries avoid the expensive rasterization and re‑encoding of RGB point clouds; the diffusion model operates natively on latents, so the cache can be accessed with a simple projection and a single tensor lookup.

**Figure 3. Overview of Mirage.** Mirage initializes a 3D latent cache from $I_0$ by encoding it into VAE latents and lifting them with depth-guided back-projection. At each target view, the cache is read through a latent-resolution projection, and generation proceeds chunk by chunk: each decoded output chunk is re-estimated for depth, re-encoded into latents, and back-projected to extend the cache.

Mirage’s latent spatial memory lets a video diffusion model reuse 3‑D geometry across chunks without ever leaving latent space.

Related Work

We position our approach relative to prior video diffusion and spatial memory methods.

Recent video synthesis work can be grouped into three strands: diffusion‑based generators, camera‑controlled conditioning, and explicit spatial‑memory augmentations.

Early approaches extended image diffusion by adding temporal attention or 3‑D convolutions; later systems scale to multi‑second, high‑resolution video via large‑scale pre‑training.

Methods that inject camera pose or epipolar cues into diffusion pipelines to steer viewpoint changes during a single generation pass.

Approaches that accumulate a 3‑D cache (often RGB point clouds) across generation steps and render conditioning images from it.

A straightforward baseline that stores rendered RGB point‑cloud frames in a persistent cache and re‑encodes them at every generation step, incurring high compute and memory overhead.

Our method reports up to 10.57× end‑to‑end speedup and a 55× reduction in memory usage compared to these baselines.

Preliminaries

Shows why pixel‑space caching is costly and introduces a latent‑space cache to avoid it.

Because the autoregressive denoiser only receives a few recent latent frames, earlier observations fade and the model drifts geometrically when the camera returns to a known area. A naïve fix is to keep a full‑resolution RGB point cloud $M_{rgb}$, but rasterising and re‑encoding that cache at every step is like printing a complete photograph for each frame instead of keeping a compact thumbnail that can be expanded on demand.

The model stores a compact latent representation of the scene’s 3D geometry, so future frames can read the geometry directly without a costly pixel‑space round‑trip.

Depth‑guided back‑projection projects $p_1$ and $p_2$ into a $2\times2$ pixel grid and the encoder $E$ maps the resulting RGB patch to a $4$‑dimensional latent vector $z_{\text{cache}}^{(0)}$.

Frame 1 is generated; its latent output $z^{(1)}$ is added to the cache: $z_{\text{cache}}^{(1)} = z_{\text{cache}}^{(0)} + z^{(1)}$.

Frame 2 is generated; the denoiser receives $z_{\text{cache}}^{(1)}$ directly, so no rasterisation occurs.

The cache after three frames contains the summed latent geometry $z_{\text{cache}}^{(3)}$, which encodes both points and their accumulated appearance.

The latent cache preserves geometry with a fixed‑size tensor, so memory and compute stay constant even as more observations are added.

How does the latent cache differ from the traditional RGB point‑cloud cache?

The RGB cache stores raw colors in world coordinates and must be rasterised back into pixel space before the VAE can consume it, incurring a full‑resolution render and a second encoding step. The latent cache, by contrast, keeps geometry already encoded in the VAE’s latent tensor, so future frames read it directly without any pixel‑space conversion.

WorldScore Evaluation

Mirage dominates the WorldScore benchmark, setting new performance records.

The WorldScore benchmark measures video generation quality across controllability, consistency, and motion dimensions. Mirage’s latent‑space cache lets it excel without the costly RGB point‑cloud re‑encoding of prior models.

Mirage attains the highest WorldScore Average Score, beating the runner‑up by 6.0 points.

Table 1 shows the next‑best average (64.36) belongs to VideoCrafter2.

Mirage’s latent‑space memory yields a clear performance edge on WorldScore, especially in 3D, photo, and style consistency.

RealEstate10K Evaluation

Mirage sets new closed‑loop PSNR records on RealEstate10K, confirming superior 3‑D consistency.

Mirage achieves the highest closed‑loop PSNR on RealEstate10K, beating the strongest baseline by 0.67 dB.

Table 2 shows Mirage’s PSNRC = 20.05 dB versus Spatia’s 19.38 dB.

Across all seven baselines, Mirage also improves SSIM (0.779 vs 0.646) and reduces LPIPS (0.250 vs 0.254) in novel‑view synthesis, while delivering the best closed‑loop SSIM (0.825) and competitive LPIPS (0.228).

RealEstate10K is a video benchmark of indoor real‑estate scenes, offering thousands of clips with varied room layouts and camera motions to test 3‑D consistency of video models.

**Figure 4.** Open-domain video comparison. Generations on out-of-domain prompts spanning outdoor and natural scenes that lie far from the RealEstate10K training distribution. Mirage generalizes beyond indoor real-estate footage, producing temporally smooth and 3D-consistent rollouts under aggressive camera motion, whereas RGB point cloud baselines show stretched textures on unseen layouts and foundation video generators drift in geometry.

Efficiency and Scaling Analysis

Mirage delivers order‑of‑magnitude speedups and memory savings while setting new quality records.

Mirage generates video 10.57× faster than RGB‑cache pipelines while using 55× less GPU memory.

Efficiency analysis shows latent spatial memory removes the per‑step pixel‑space round‑trip, yielding a 10.57× speedup and a 55× memory reduction at matched rollout length.

**Figure 5.** Efficiency scaling with rollout progress. Per-frame cache-read time (left) and peak cache footprint (right) measured on a single NVIDIA H100 across five autoregressive rollout chunks. Numbers above each bar report raw measurements (in s/frame and MiB respectively); the y-axes use a linear scale so the gap between methods is shown faithfully. After the first chunk amortises a one-off setup pass, Mirage settles at a per-frame cost of 0.25 s and a cache footprint that grows by less than 0.5 MiB per chunk. RGB-cache baselines (Spatia, Gen3C) require an order of magnitude more memory and one-to-two orders of magnitude more time per frame, since every conditioning step re-rasterises the accumulated point cloud and re-encodes the result through the VAE. The view-memory baseline VMem keeps memory bounded but still scales linearly because its retrieval cost grows with the number of stored views. Latent spatial memory removes the pixel-space round trip from the per-step critical path, leaving the conditioning loop with a single latent-resolution projection.

**Figure 6.** Video comparison on RealEstate10K. Each block shows one RealEstate10K trajectory, with rows corresponding to Voyager, Spatia, VMem, and Mirage (Ours), and columns showing uniformly sampled frames over time. Across indoor and outdoor scenes, Mirage preserves sharper structure and more stable appearance under camera motion, while baselines exhibit geometry drift, texture distortion, or accumulated artifacts.

**Figure 7.** Closed-loop revisit comparison on RealEstate10K. In the closed-loop test, the camera trajectory gradually returns to its starting point. The comparison between the last frame and the input frame shows that Mirage maintains strong consistency under the revisit setting.

Ablation Studies

We quantify how each Mirage component affects performance on the WorldScore benchmark.

We evaluate each design choice by disabling or swapping a single component while keeping the rest of Mirage unchanged. The results are reported on a held‑out split of the WorldScore prompts.

Replacing the latent cache with an RGB point cloud drops the average score by 2.65 points and harms both 3‑D and photometric consistency, confirming that the latent cache preserves information the backbone cannot recover from raw RGB. Upsampling features to full‑resolution before readout further degrades all metrics, showing that geometry‑aligned downsampling is both cheaper and more faithful. Removing the dynamic‑object mask harms long‑horizon stability, especially 3‑D and photo consistency, because stale moving content re‑enters the memory. Finally, collapsing the two‑stage schedule into a single joint training phase reduces quality across the board, indicating that early‑stage backbone freezing stabilises optimisation.

Swapping the default DepthAnything 3 reconstructor for the noisier MapAnything or UniDepth predictors only modestly reduces scores (≤0.8 Avg), because the ControlNet‑style side branch treats the projected cache as a soft geometric hint and the dynamic filter removes outliers before they enter memory. This demonstrates that Mirage’s latent spatial memory does not rely on a single, high‑precision depth estimator.

Geometric Foundations

Key geometric conventions linking world, camera, and latent spaces.

Extrinsics $E_t\in SE(3)$ describe the rigid transform that brings a world‑space point into the coordinate frame of camera $t$, while $E^{-1}$ performs the opposite mapping.

The camera intrinsics are captured by matrix $K$, whose diagonal entries are the focal lengths $(f_x,f_y)$ and whose last column encodes the principal point $(c_x,c_y)$ in the standard pinhole model.

The VAE processes images with a spatial stride $s=16$, so a pixel image of size $H\times W$ is downsampled to a latent map of size $h\times w=(H/s)\times(W/s)$.

Each latent cell is addressed by integer coordinates $(u,v)$ and corresponds to the pixel‑center homogeneous point $[u+\tfrac12,\;v+\tfrac12,\;1]^\top$.

To keep projection consistent after downsampling, we define a latent‑resolution intrinsic matrix $K_\ell$ by scaling the original $K$ along each axis by the ratios $w/W$ and $h/H$.

The back‑projection operator $\pi^{-1}$ takes a latent cell $(u,v)$, a depth value $d$, and maps it to a 3‑D world point by first unprojecting into camera space with $K_\ell^{-1}$, scaling by $d$, and finally applying the inverse extrinsic $E^{-1}$.

Conversely, $\pi_\ell$ projects a 3‑D point $q_i$ into the latent grid by applying $K_\ell$, normalising by the depth $[q_i]_z$, and flooring the resulting image coordinates to integer cell indices.

All memory points whose projected cell equals $(u,v)$ and lie in front of the camera form the candidate set $\Omega_t(u,v)$ used during readout.

The admissible cell set $\Lambda_t$ restricts attention to cells that have a finite, positive depth and are not covered by dynamic‑object or sky masks.

Additional Experimental Analysis

Additional experiments explain the efficiency and behavior of latent spatial memory.

This appendix expands the analysis from Section 5, focusing on why latent spatial memory is computationally cheap, how its cost scales with longer rollouts, and what information the cache actually stores.

Reading an RGB cache costs $\Theta\!\big($$N$ log $N + HW$\big) + $\Theta\!\big(\Phi_E(H,W)\big)$ per conditioning step, where $\Phi_E$ is the VAE encoder FLOP count; a latent cache costs only $\Theta\!\big($$N$ log $N$ + $HW$\big) because the encoder term disappears and the rasterisation term shrinks by the VAE stride squared.

The latent cache occupies $s^{2}\!\cdot\!\big(3/C\big)$ of the memory of an RGB cache, and as the rollout horizon grows the $N$ log $N$ sort term dominates the latent cache much later, while the RGB baseline runs out of GPU memory on trajectories that Mirage completes comfortably.

On a $257$‑frame rollout, rasterisation and the VAE encoder together account for over $98\%$ of the per‑step cost of the RGB pipeline; both stages are absent from Mirage’s conditioning loop because a single latent‑resolution projection replaces them.

Mirage calls the decoder only once per chunk to materialise output frames and to feed depth and segmentation modules, so the decoder never appears on the per‑step critical path.

Depth must be down‑sampled to match the latent grid resolution; we compare bilinear, nearest‑neighbour, area pooling, and median pooling, each affecting edge fidelity and hole formation differently.

Bilinear interpolation yields the lowest hole rate, so we adopt it as the default down‑sampling method.

**Figure 8.** Additional video comparison on a challenging indoor trajectory. Mirage maintains coherent layout over the full trajectory, whereas baselines suffer from view-dependent deformation, blur, and inconsistent scene reconstruction.

Projecting each latent token’s feature vector onto its top three principal components and coloring by the resulting RGB reveals coherent semantic clusters—walls, floors, windows, and furniture—that cannot be recovered from an RGB point cloud built on the same frames.

Implementation Details

Implementation specifics: backbone, conditioning branch, training pipeline, and rollout algorithm.

The backbone is the Wan2.2‑TI2V‑5B model; its VAE operates with a spatial stride $s=16$, temporal stride $4$, and $C=48$ channels. Each generation chunk occupies a $9\times44\times80$ latent tensor, which corresponds to $33$ RGB frames at $704\times1280$ resolution. The VAE remains frozen throughout inference.

The transformer stack uses a hidden size of $3072$, a feed‑forward dimension of $14336$, $24$ attention heads, and $30$ blocks, processing up to $512$ text tokens. RMS normalization is applied to queries/keys, and an additional cross‑attention normalization aligns visual and textual modalities.

A ControlNet‑style side branch injects the latent readout $\hat{z}_t$; it mirrors the VACE blocks of the backbone and reuses the same patch embedding, attaching at layers $\{0,4,8,12,16,20,24,28\}$ with a $48$‑channel input that matches the VAE latent, eliminating the need for a separate bridging encoder.

Segment‑aware rotary positional encoding labels each frame as a noisy target, a clean preceding frame, or a clean reference, all within a single forward pass; at inference the denoised latents of the previous chunk become the preceding frames for the next chunk.

Training proceeds in two flow‑matching stages. Stage 1 updates only the side branch at learning rate $10^{-5}$ while keeping the backbone and VAE frozen. Stage 2 unlocks rank‑$64$ LoRA adapters on the $q$, $k$, $v$, and $o$ projections of every self‑attention layer (dropout $0.05$) and jointly optimises them with the side branch at learning rate $10^{-4}$.

Optimization uses AdamW ($\beta=(0.0,0.999)$, weight decay $10^{-3}$) with a cosine learning‑rate schedule, bfloat16 mixed precision under FSDP sharding, gradient checkpointing, and a text‑dropout probability of $0.2$.

At inference the UniPC flow scheduler runs $40$ steps with classifier‑free guidance disabled; all measurements are reported on a single NVIDIA H100 GPU.

Training clips are drawn from RealEstate10K. Each clip is processed by a feed‑forward reconstructor for metric depth, intrinsics, and per‑frame extrinsics, then by Qwen3‑VL‑2B for entity extraction followed by SAM3 to obtain foreground‑dynamic and sky masks. Masked cells are excluded from $\Lambda_t$ (Eq. 6) so only geometry consistent with the rigid‑scene assumption enters the persistent memory.

All frames, latents, depth maps, and camera parameters are stored in an LMDB‑backed dataset, eliminating any re‑encoding during training. Rollouts are generated in chunks of nine latent frames with a one‑frame overlap to preserve temporal continuity; after each chunk the cache is expanded by re‑encoding the decoded frames and lifting them via Eq. 4.

**Algorithm 1** One rollout of Mirage. **Require:** initial frame $I_0$; camera trajectory $\{(E_t, K_t)\}_{t=0}^T$ with $E_0$ fixed to the world frame 1: $z_0 \leftarrow \mathcal{E}(I_0)$ 2: $D_0 \leftarrow \text{DEPTHANYTHING3}(I_0)$ 3: $M_0 \leftarrow \text{SAM3}(\text{QWEN3-VL}(I_0)) \cup \text{SKY}(I_0)$ 4: $\mathcal{M} \leftarrow \{(p_{uv}, f_{uv}) : (u, v) \in \Lambda_0\}$ via Eq. 4 on $(z_0, D_0, K_0, E_0)$ 5: $\tau \leftarrow 0$; $\mathcal{O} \leftarrow \{I_0\}$ $\triangleright$ collected output frames 6: **while** $\tau < T$ **do** 7: $\quad$ sample latent chunk $W = \{\tau + 1, \dots, \tau + |W|\}$ 8: $\quad$ **for** $t \in W$ **do** $\triangleright$ read at latent resolution 9: $\quad \quad \hat{z}_t, m_t \leftarrow$ readout of $\mathcal{M}$ at $(E_t, K_t)$ via Eq. 5 10: $\quad$ **end for** 11: $\quad \{z_t\}_{t \in W} \leftarrow \text{backbone}(\{\hat{z}_t, m_t\}_{t \in W}, \text{preceding}, \text{reference})$ 12: $\quad$ **for** $t \in W$ **do** $\triangleright$ decode-and-re-encode update, once per chunk 13: $\quad \quad I_t \leftarrow \mathcal{D}(z_t)$; append $I_t$ to $\mathcal{O}$ 14: $\quad \quad D_t \leftarrow \text{DEPTHANYTHING3}(I_t)$ 15: $\quad \quad M_t \leftarrow \text{SAM3}(\text{QWEN3-VL}(I_t)) \cup \text{SKY}(I_t)$ 16: $\quad \quad \tilde{z}_t \leftarrow \mathcal{E}(I_t)$ 17: $\quad \quad \mathcal{M} \leftarrow \mathcal{M} \cup \{(p_{uv}, f_{uv}) : (u, v) \in \Lambda_t\}$ via Eq. 4 on $(\tilde{z}_t, D_t, K_t, E_t)$ 18: $\quad$ **end for** 19: $\quad \tau \leftarrow \tau + |W|$ 20: **end while** 21: **return** $\mathcal{O}$

Read the original paper

Open the simplified reader on Paperglide