Echo-Memory: A Controlled Study of Memory in Action World Models

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

A controlled study of memory mechanisms in action-conditioned world models, isolating storage, compression, and read-out.

Do current memory mechanisms in action-conditioned world models actually preserve semantic scene information, or do they merely optimize for local video smoothness?

Action world models often fail to maintain scene consistency: when a camera leaves a region and returns, objects silently change or vanish because the model lacks a reliable memory of the world. Echo-Memory fixes the backbone, training, and evaluation pipeline, varying only the memory representation to isolate how history is stored, compressed, and read by the generator. The study reveals that replay fidelity is a poor proxy for memory; block-wise state-space recurrence emerges as the strongest mechanism for semantic revisit, while raw context remains a stubborn capacity baseline.

Paper Primer

The paper treats memory as a common interface problem rather than a new architecture. It compares four families—raw context, compression, spatial summaries, and state-space recurrence—under a unified video diffusion-transformer backbone to separate the effects of capacity, compression, read-out, and recurrence.

Replay metrics (PSNR/SSIM) are insufficient for measuring world-model memory.

Models with high replay fidelity often fail to preserve object identity during open-domain return probes, while models with lower replay scores (like block-wise state-space) perform better at semantic revisit. Block-wise state-space recurrence achieves an open-domain VLM score of 69.00, significantly outperforming the I2V baseline (12.25) and spatial memory (6.00).

Raw context is a strong capacity baseline for semantic return.

Increasing context length from K=1 to K=20 improves open-domain VLM return scores from 12.25 to 58.63, with most gains realized by K=5. The gain in semantic revisit consistency from raw context is substantially larger than the corresponding improvement in low-level pixel replay metrics.

Why does the paper use a three-branch evaluation protocol instead of standard video metrics?

Standard metrics like PSNR only measure local trajectory following (replay). The paper introduces in-domain and open-domain return probes to specifically test if the model can preserve object identity and scene geometry after the camera leaves and returns to a view.

What is the primary limitation of spatial memory summaries in this study?

Spatial summaries are efficient but often under-specify object identity. The study finds that even when storage is compact, the generator often lacks an explicit read-out path to retrieve the specific object evidence needed for revisit.

The Memory Gap in World Models

We expose the memory gap in action‑conditioned video models and outline our controlled study.

Action world models must generate video chunks conditioned on a first frame, a text prompt, and a sequence of camera actions, while preserving geometry, object identity, and camera obedience across revisits. In practice, the camera may return to the starting pose yet the scene silently changes, salient objects are replaced, or chunk boundaries erase accumulated context—these are memory failures, not generic synthesis errors. Existing memory designs (Context, Compression, Spatial, State‑space) are hard to compare because reported gains are entangled with backbone, training, and evaluation differences, motivating a controlled study that isolates the memory mechanism.

An action world model predicts future video frames given an initial observation, a textual description, and a planned camera‑action trajectory, treating the video generation as a sequential rollout conditioned on explicit actions.

**Figure 1.** Abstract teaser and workflow of Echo-Memory. Given a text description, historical observations, and the camera/action state, an action world model must generate chunk-wise video while carrying memory across revisits. The figure positions Context, Compression, Spatial, and State-Space families as representative designs for preserving a revisitable world.

The disconnect between local video synthesis and long‑term scene retention is the core challenge this work isolates.

Memory Families and Architectures

Defines the four memory families and shows how they plug into a fixed video diffusion backbone.

The method section formalizes a clean factorization of memory for action world models, separating the choice of memory family from the underlying diffusion backbone.

Four orthogonal ways to store past observations—Context, Compression, Spatial, and State‑Space—each defines what information is kept and how it is accessed during generation.

How does “Memory Families” differ from the generic “memory mechanisms” discussed in related work?

Memory families are a taxonomy of *where* and *how* past information is stored (raw tokens, compressed tokens, geometry, or a recurrent state). Generic mechanisms in related work describe *what* cue is used to select memory (e.g., temporal proximity or geometry) but do not prescribe the storage format. The families therefore fix the representation, making it possible to isolate the effect of storage on replay fidelity versus semantic consistency.

Encode each target frame $x_t$ with a VAE to obtain a latent token group $z_t$.

Sample $K$ historical observations $\{(x_{f_k},p_{f_k})\}$ using the retriever $R$; with probability $0.1$ drop all retrieved frames and keep only the anchor $x_s$.

Form the context set $C_{s:e}=\{(x_{f_k},p_{f_k})\}_{k=0}^{K-1}$ and convert it to latent tokens $c_{\text{ctx}}$ according to the chosen memory family.

Feed $c_{\text{ctx}}$, the text embedding $\text{ctext}$, and the action sequence $\text{cact}$ into the diffusion transformer $v_\theta$ to predict the velocity field $v_\theta(z_t;\dots)$.

Integrate the predicted velocity to obtain the next latent $z_{t+1}$, then decode with the VAE decoder to produce the output frame.

**Figure 2** Overview of four representative approaches to memory in action world models. Under a shared video diffusion backbone and a shared camera-action interface, the approaches differ only in how historical information is stored and read. We compare four families throughout: Context, Compression, Spatial, and State-Space.

Implementation Variants

Describes the experimental variants, shared defaults, and the unified memory interface.

The section enumerates the concrete variant families, the ablation dimensions, and the shared hyper‑parameters that make a controlled comparison possible.

Pick a memory profile: choose a storage type (e.g., Context), set context length $K$, select a compression rule (if any), decide the read‑out path, and optionally enable a recurrence structure.

Instantiate the Video DiT backbone with the shared hyper‑parameters (resolution $352\times640$, segment length 81 frames, optimizer AdamW, learning rate $5\times10^{-5}$, 8 A100‑80G GPUs, total 5 k steps).

Apply the default conditioning (relative‑RT camera encoding, flow‑matching timestep shift) and enable target‑frame‑only supervision.

Run the training loop, which uses the unified sampler and evaluation pipeline that query the memory via the same interface for every variant.

After training, evaluate each variant on Replay Fidelity (PSNR, SSIM) and Semantic Revisit Consistency, recording the numbers in Table 1.

Step 1: The sampler draws a 81‑frame video segment and stores the first 5 frames in the compression buffer.

Step 2: Compression reduces the 5‑frame buffer to a single latent vector via average pooling.

Step 3: During generation, the model queries the compressed vector at each target frame, producing a reconstruction.

Step 4: PSNR is computed on the reconstructed frames (e.g., 28.7 dB) and SSIM on structural similarity (0.91).

Step 5: The same trial is repeated with $K=20$; the larger context yields higher Replay Fidelity (30.2 dB) but lower Semantic Revisit Consistency because the compression discards fine‑grained object identity.

This concrete run shows how increasing context length improves pixel‑level fidelity while the compression rule can hurt semantic consistency, illustrating the trade‑off the ablations explore.

The table compares the performance of different judges (Qwen3-VL-30B-A3B, Claude Opus 4.6, GPT-5.5, and Human) based on their mean scores, the delta ($\Delta$) compared to Qwen, and the correlation ($\rho$) compared to Qwen.

Evaluation Protocol

Standardized training protocol and a three‑branch evaluation suite quantify replay fidelity and semantic revisit consistency.

The experiments follow a single, tightly controlled training pipeline; only the memory profile differs across variants. Evaluation splits this pipeline into three complementary branches that separately probe pixel‑level replay quality and higher‑level semantic consistency.

Replay Fidelity quantifies how faithfully a model reproduces the exact pixel content of a long‑horizon video when following the ground‑truth camera trajectory.

Is Replay Fidelity just a $PSNR$ measurement?

No—while $PSNR$ captures average pixel error, Replay Fidelity also incorporates $SSIM$ (structural similarity) and $LPIPS$ (perceptual similarity) and is evaluated on multi‑chunk sequences, exposing drift that $PSNR$ alone would miss.

Semantic Revisit Consistency measures whether a model preserves object identity and overall scene layout after the camera leaves a location and later returns.

How does Semantic Revisit Consistency differ from standard pixel metrics?

Pixel metrics compare raw colors and can miss high‑level changes; Semantic Revisit Consistency uses a vision‑language model to judge whether the same object, viewpoint, and scene semantics survive the leave‑and‑return trajectory, capturing identity and layout preservation.

Initialize the backbone, optimizer, learning‑rate schedule, data sampler, and action representation; fix the evaluation path.

Load a batch from the Context‑as‑Memory dataset: an $81$‑frame segment at $352\times640$ resolution.

Encode each frame’s camera motion as a $12$‑dimensional relative $RT$ vector referenced to the first frame.

Retrieve context frames using the fixed field‑of‑view policy.

Construct actions from the $RT$ vectors via the same $RT$‑relative constructor used at test time.

Apply the chosen memory profile (e.g., compressed hash, recurrent slot, or raw context).

Compute the scalar denoising loss on the generated frames.

Sample a fixed replay case at a predetermined interval; run the full generation path (retriever → action constructor → memory) and record visual diagnostics (boundary continuity, identity drift, action‑visual alignment).

Back‑propagate the combined loss (denoising + diagnostic) and update model parameters.

Context retrieval selects the nearest $5$ frames by field‑of‑view.

Each $RT$ vector is fed to the action constructor, producing $5$ action embeddings.

The compressed hash maps each embedding to a bucket; collisions are resolved by linear probing.

The model generates the next frame, computes $PSNR=28.3$, $SSIM=0.92$, $LPIPS=0.14$ for the first chunk.

Replay diagnostic samples chunk 3, finds a drop to $PSNR=24.7$ indicating identity drift.

This concrete trace shows how the replay diagnostic catches a fidelity drop early, even though the overall denoising loss remains low.

**Figure 4** Evaluation taxonomy used in the study. Replay measures long-horizon image quality under GT camera motion. In-domain and open-domain return probes measure whether visual evidence survives a leave-and-return trajectory.

**Figure 5.** Qwen Edit construction of open-domain first frames. Each held-out environment prompt is edited into identity-anchored first frames, which seed the open-domain return probe.

**Figure 6.** Open-domain revisit source panel. The 2x4 grid shows representative first-frame sources from the Qwen-edit open-domain pool. The full pool is generated by editing distinctive toy-like objects into held-out game-style environments; a practical construction uses 20 scene prompts and 8 edited variants per prompt. These objects serve as deliberately simple identity anchors for the VLM judge: after the camera leaves and returns, the judge scores whether the same object appearance, subject presence, camera viewpoint, and background scene are preserved.

The Replay-Memory Divergence

Key results comparing memory families on replay fidelity and semantic consistency.

Recall that action world models can produce locally realistic video yet lose object identity when revisiting a location. This section quantifies that failure across memory mechanisms.

Context learning with K=20 attains the highest semantic revisit consistency (R‑S) among all memory families.

Table 3 shows R‑S = 0.449 for K=20, exceeding the next best 0.411 (Spatial Memory).

**Figure 3.** Replay progression on a fixed GT camera trajectory. The diagnostic samples compare generated multi-segment replay against a dataset trajectory at matched time indices. The panel reveals where background structure, object layout, and boundary continuity begin to drift before the final revisit probes are run.

High replay fidelity is a health signal, not a semantic memory score.

Comparing Memory Families

Open-domain return sharply separates memory families, exposing replay’s limits.

Open‑domain return separates memory families far more than replay fidelity.

Open‑domain VLM scores range from 12.25 (I2V baseline) to 58.63 (K = 20), while Replay PSNR varies only modestly across families.

Spatial Read-out and Compression

Spatial summaries are fast but still fail to preserve object identity without explicit read‑out.

Spatial summaries are efficient but not yet reliable as semantic memory.

Dedicated cross‑attention reaches 17.12 on open‑domain VLM, yet remains far below raw Context and block‑wise State‑Space; the inject‑none row achieves the highest Replay score despite withholding stored tokens, showing that replay gains can stem from regularization rather than usable memory.

The table lists various training settings and their corresponding values, including Backbone (Video DiT), Resolution (352 x 640), Segment length (81 frames), Context length (varies), Memory module (varies), Optimizer (AdamW), Learning rate (5x10^-5), Batch/grad accumulation (1/1), GPUs (8 A100-80G), Total steps (5k), Timestep shift (15), Spike rejection threshold (15.0), Target-frame-only supervision (enabled), Flow noise shift (1.0), Relative-RT action encoding (enabled), Context placement (suffix), and 10% overlap-drop policy (enabled).

State-Space Memory Performance

Ablation compares five state-space mechanisms on seven metrics, highlighting trade‑offs.

This table evaluates five state‑space variants on seven quality and consistency metrics, revealing distinct strengths.

Raw Context and Implicit Memory

Experimental variations of memory mechanisms and their impact on replay fidelity and open‑domain performance.

We evaluate a suite of memory designs by swapping the core state module while keeping all other training settings fixed. Each variant is assessed on two axes: pixel‑level replay fidelity (PSNR, SSIM) and open‑domain video‑language modeling performance.

Raw Context is the uncompressed, full‑length token stream that the model ingests unchanged, preserving every frame’s visual evidence for downstream generation.

Select a memory family (block‑wise recurrence, hybrid row, weight‑only compression, length compression with ratio $r$, or raw context).

Fix the context length $K$ (evaluate $K\!=\!5$ and $K\!=\!20$ for raw context; other methods inherit the same $K$).

Train the diffusion transformer on the same video dataset with identical optimizer, learning‑rate schedule, and batch size.

After convergence, run two evaluation suites: (a) replay on in‑domain trajectories, measuring PSNR and SSIM; (b) open‑domain VLM generation, measuring the task‑specific return metric.

Record the replay cost (GPU‑hours) and the open‑domain score for each configuration.

Raw context stores frames $f_1\ldots f_5$ verbatim; the generator can attend to any $f_i$ directly.

Block‑wise recurrence first compresses $f_1\ldots f_5$ into a hidden state $h$, updates $h$ sequentially, and then conditions generation on $h$.

During replay, raw context yields PSNR = 31.2 dB, SSIM = 0.92; block‑wise recurrence yields PSNR = 29.8 dB, SSIM = 0.88 but improves open‑domain VLM score by +4.3 %.

This illustrates the trade‑off: preserving every pixel boosts local fidelity, while a structured recurrent state sacrifices some detail but supplies the semantic scaffolding needed for open‑domain tasks.

Scaling and Recurrence

Scaling experiments expose distinct trade‑offs between replay fidelity, semantic return, and compute cost.

Scaling experiments reveal that increasing raw context length boosts open‑domain semantic return far more than replay fidelity, while compact mechanisms trade replay efficiency for semantic consistency.

The model updates only a fixed‑size block of its recurrent state each step, keeping long‑range evidence without expanding the full hidden vector.

Block‑wise recurrence delivers the largest open‑domain gain among compact memories, surpassing raw Context K=20 despite a modest replay‑quality penalty.

**Figure 9. Context scaling and training efficiency.** Left: increasing raw context improves open-domain semantic return much more than the replay image bundle. Right: normalized replay PSNR versus normalized GPU-hour per step illustrates that replay efficiency and semantic memory are different optimization targets.

Design Implications

Discussion of design trade‑offs and future directions for memory in action world models.

The controlled matrix acts as a concrete checklist for future memory modules. It insists that any new mechanism be benchmarked against raw context across several K values. This prevents the weak claim of merely beating an anchor‑only I2V baseline.

The central question is whether a method can keep the semantic return benefit of raw context while using fewer tokens, less compute, or a more stable internal state. Answering this requires explicit measurement of both replay fidelity and semantic revisit consistency. Only then can efficiency gains be trusted.

Write capacity and read‑out capacity should be reported as separate metrics. Spatial Memory demonstrates that a compact scene summary can be stored, yet the right evidence may be unavailable to the generator at return time.

Compression quality must be judged by the evidence it preserves, not merely by how many tokens it removes. A good compressor for world models is therefore a selective memory that retains views and objects likely to become load‑bearing later.

The current single‑stage objective supervises target‑frame denoising, while revisit consistency is measured only after generation. This mismatch explains why replay can improve without semantic return improving. Adding return‑aware auxiliary supervision—object‑level retention losses, contrastive alignment, or VLM‑guided selection—closes the gap.

Training should also expose recurrent states to explicit object or view prediction, pressuring the state to encode evidence needed after the camera leaves visible support. Even as video backbones become stronger, the evaluation must remain multi‑branch to keep the distinction between following and remembering.

Stronger generators can hallucinate plausible replacements for missing objects, making qualitative failures harder to spot without a return‑specific semantic check. The three‑branch protocol therefore serves as a guardrail for future systems that may look visually convincing yet still lose world consistency.

Echo‑Memory isolates memory and context effects by fixing backbone, optimizer, sampler, and evaluation protocol. The resulting matrix cleanly separates a model’s ability to follow camera actions from its ability to preserve the underlying world.

Three lessons emerge. First, raw context is a strong baseline: increasing context improves open‑domain revisit more reliably than it improves replay metrics. Second, compactness alone is insufficient; Spatial Memory and hybrid compression are efficient but lose identity evidence without object‑aware retention and read‑out. Third, implicit memory depends on structure; block‑wise recurrence preserves open‑domain return far better than lightweight recurrent smoothing.

Replay quality and semantic revisit quality are not monotonically aligned. Selecting models by replay alone would favor families that fail the return‑specific semantic check.

The study is bounded: it trains on a single dataset, so rankings may shift with noisier poses, different camera statistics, larger compute, or multi‑stage curricula. Open‑domain scores rely on a VLM judge; broader human calibration would strengthen the comparison. Finally, revisit quality is not yet a cheap training‑time signal, so model selection still leans on imperfect proxies.

Future work should make memory design explicitly revisit‑aware: compression must preserve high‑value object evidence, spatial summaries should expose object‑level read‑out to the generator, and structured recurrence should be evaluated as a first‑class design axis.

Design memory modules that keep semantic evidence for future returns, report write/read capacities separately, and evaluate with a three‑branch protocol.

Technical Appendix

Supplementary details, tables, and methodological notes for the study.

This appendix collects the auxiliary tables, full training protocol, and additional methodological notes that support the main analysis.

The study builds on a diffusion‑Transformer video backbone (DiT) that encodes each frame as a latent token and processes them with self‑attention, allowing per‑frame action signals to be injected at configurable points.

Camera actions are represented as 12‑dimensional relative‑RT poses; the injection site (post‑attention, pre‑norm, etc.) is treated as a design variable because small plumbing choices can dominate overall performance.

Memory mechanisms fall into four families—Context, Compression, Spatial, and State‑Space—each differing in how they store, compress, or recurrently propagate visual evidence for long‑horizon generation.

Evaluation combines low‑level image metrics (PSNR, SSIM, LPIPS) with semantic revisit metrics (in‑domain loop closure and open‑domain VLM scores) to separate replay fidelity from memory retention.

Echo‑Memory does not introduce a new storage module; instead it provides a common interface, a matched ablation protocol, and a three‑branch evaluation stack to make interactions among storage, compression, read‑out, and recurrence observable.

All variants share a fixed training budget: 5 k steps on 8 × A100‑80G GPUs, identical optimizer settings, and the same video DiT backbone; only the memory module changes.

The Context‑as‑Memory dataset supplies 81‑frame segments at 352 × 640 resolution, each paired with a 12‑dim camera trajectory and a buffer of K − 1 historical frames retrieved by field‑of‑view overlap.

Training proceeds in a single stage; multi‑stage curricula were abandoned after pilot experiments showed no consistent benefit.

Replay sampling is run periodically during training using a fixed pool of held‑out first frames; the same generation pipeline as evaluation is used, making the sampled videos a direct preview of final performance.

Protocol invariants enforce identical context handling, noise levels, K value, target‑frame‑only supervision, and camera‑action parameters between training and inference.

The empirical analysis is organized around three axes—replay fidelity, in‑domain revisit, and open‑domain VLM—so that disagreements across columns reveal which memory design solves which sub‑problem.

Memory‑family results show that Spatial Memory excels at replay PSNR, raw Context excels at replay SSIM/LPIPS, and block‑wise State‑Space dominates open‑domain VLM, highlighting the non‑monotonic relationship between low‑level fidelity and semantic memory.

Spatial read‑out ablations demonstrate that withholding stored tokens can improve replay metrics, indicating that the spatial summary acts more as a regulariser than a true memory.

Compression ablations reveal that pure length reduction (r = 4) outperforms weight‑only compression on open‑domain VLM, while hybrid combinations degrade performance because pooling discards information before weighting can act.

State‑Space comparisons show that a legacy hybrid recurrence yields strong replay and in‑domain scores, whereas block‑wise recurrence provides the best open‑domain VLM, confirming that recurrence design matters for semantic return.

Increasing raw context length K improves both replay and open‑domain VLM, but the VLM gain is far larger (≈ 46 points from K = 1 to K = 20), underscoring that revisit consistency scales differently from pixel‑level reconstruction.

The quality–time trade‑off panel plots replay PSNR against per‑step GPU‑hour cost, showing that Spatial Memory is efficient for low‑level reconstruction, while block‑wise State‑Space trades efficiency for superior semantic return.

Cross‑cutting observations synthesize three takeaways: (A) replay and revisit quality often diverge; (B) storage and read‑out are independent design knobs; (C) compact mechanisms (Spatial, hybrid compression) sacrifice open‑domain VLM, whereas raw context and block‑wise recurrence retain it.

Read the original paper

Open the simplified reader on Paperglide