minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu

An open-source framework to convert bidirectional video diffusion models into real-time, camera-controllable world models.

How can we transform high-latency, bidirectional video diffusion models into low-latency, camera-controllable autoregressive world models?

Video foundation models generate high-quality clips, but they are too slow and lack the causal control required for real-time interactive applications. minWM provides a modular pipeline that fine-tunes these models for camera control and then distills them into few-step autoregressive generators. This approach reduces first-frame latency by over 200x compared to the original bidirectional models, enabling interactive performance on standard hardware.

Paper Primer

The framework operates in two phases: first, it injects camera-control signals into the model's self-attention layers using projective transformations; second, it uses a multi-stage distillation process to transform the model into a low-latency autoregressive generator.

minWM achieves massive reductions in first-frame latency while maintaining camera controllability.

Comparison of first-frame generation time between multi-step bidirectional baselines and the distilled few-step autoregressive models on A800 GPUs. 223.75x speedup for HY1.5 and 236.64x speedup for Wan2.1.

The distillation process relies on three stages: teacher-forcing autoregressive training, causal consistency distillation to initialize the few-step model, and asymmetric distribution matching to align the student with the high-quality bidirectional teacher.

Why is a full-stack framework necessary instead of just using existing video foundation models?

Existing foundation models are bidirectional and offline, making them unsuitable for real-time interaction. minWM automates the complex, multi-stage pipeline—including data construction, controllable fine-tuning, and distillation—required to make these models causal and low-latency.

What is the primary constraint on the quality of the camera control?

The authors observe that ground-truth camera trajectories are essential; models trained on perception-estimated poses from existing datasets failed to achieve reliable control, necessitating the use of reconstructed scenes or synthetic trajectories.

Researchers can now use minWM as a standardized recipe to adapt diverse video backbones into interactive world models, with clear guidance on the batch sizes and training steps required for stable camera control.

Motivation and Problem Framing

Video diffusion models are high‑quality but far too slow for real‑time interactive use.

Current video diffusion foundations excel at generating high‑fidelity clips, yet each denoising step must materialize the entire video tensor, making rollout prohibitively slow. Interactive world models demand causal, camera‑controlled generation with sub‑second latency, a requirement that existing bidirectional pipelines cannot meet. minWM is introduced as a full‑stack solution that bridges this gap by distilling bidirectional diffusion into few‑step autoregressive generators.

Bidirectional diffusion denoises a video by iterating forward and backward in time, touching every frame at each step, which yields high quality but forces the model to keep the full video tensor in memory throughout the process.

HY1.5‑TI2V‑8B and Wan2.1‑T2V‑1.3B are publicly released video diffusion backbones; the former accepts both text and an image as conditioning, while the latter relies on cross‑attention to inject textual prompts.

Memory per step = 131 KB; total memory for $S=10$ steps = 1.3 MB.

If we increase resolution to $256\times256$ while keeping $N=8$, each step requires $8\times256\times256\times4 = 2{,}097{,}152$ floats (≈ 8 MB), so $10$ steps need ≈ 80 MB.

Real‑world models often use $N=64$ frames at $256\times256$, leading to $64\times256\times256\times4 = 16{,}777{,}216$ floats per step (≈ 64 MB); $10$ steps would exceed 600 MB, and additional overhead pushes the requirement past 200 GB on a full‑precision GPU.

The memory and compute grow linearly with the number of frames and quadratically with spatial resolution, explaining why naïve bidirectional diffusion cannot meet real‑time latency constraints.

The latency gap between high‑quality diffusion foundations and real‑time interactive video remains the primary obstacle.

The minWM Pipeline

We detail the core pipeline that turns a bidirectional diffusion model into a low‑latency, camera‑controllable autoregressive video generator.

We now describe how to turn a multi‑step bidirectional diffusion model into a low‑latency, camera‑controllable autoregressive video generator.

minWM stitches together data construction, camera‑aware fine‑tuning, AR training, and distillation into a single end‑to‑end workflow that yields a real‑time controllable video model.

How does minWM differ from simply fine‑tuning a diffusion model on video data?

minWM adds explicit camera conditioning, distills the bidirectional diffusion into an autoregressive model, and optimizes the inference pipeline for low latency; a vanilla fine‑tune would keep the original multi‑step bidirectional generation and lack real‑time view control.

Instead of denoising each frame over many diffusion steps, the AR model predicts the next frame in only a few steps, enabling interactive speeds.

Why not use a single‑step diffusion model for real‑time video?

A single step would discard the gradual refinement that diffusion provides, leading to severe visual artifacts; few‑step AR retains enough refinement to keep quality while still achieving interactive latency.

Camera controllability means the model receives explicit camera intrinsics and pose for each frame, allowing the user to steer viewpoint during generation.

Does providing camera parameters force the model to learn geometry?

The parameters are injected into the attention computation, so the model learns to align visual features with the supplied camera transform rather than inferring geometry implicitly.

Distillation compresses a high‑quality but slow bidirectional diffusion teacher into a fast autoregressive student while preserving visual fidelity.

Why is an extra asymmetric DMD step needed after AR distillation?

The AR student inherits the teacher’s limited generation quality; asymmetric DMD uses the original bidirectional teacher to correct this bias, boosting fidelity without sacrificing speed.

PRoPE augments each token’s query/key/value with a block‑diagonal matrix that encodes the frame’s camera intrinsics and pose, making attention explicitly aware of viewpoint.

Compute $eP_1 = \begin{bmatrix} I & 0 \\ 0 & 1 \end{bmatrix} I = I_{4}$ and $eP_2 = \begin{bmatrix} I & 0 \\ 0 & 1 \end{bmatrix} T_{cw_2}$, where $T_{cw_2}$ adds a translation of $1$ along the $x$ axis.

Form $D^{\text{PRoPE}}_t$ by Kronecker‑producting $\mathrm{Id}_{d/8}$ with $eP_2$ and appending rotary embeddings for $x_t=2$, $y_t=3$.

Compute the relative transform $eP_2 eP_1^{-1}=eP_2$; this matrix encodes the $1$‑unit camera shift.

When attention evaluates the interaction between a token from frame 1 and token $t$ from frame 2, the score is multiplied by the relative transform, explicitly biasing the attention toward the camera motion.

PRoPE makes the attention score a direct function of the relative camera pose, so viewpoint changes are reflected instantly in the attention map rather than being learned implicitly.

The pipeline converts the camera‑aware bidirectional diffusion model into a few‑step AR model through three stages: AR training, causal initialization, and asymmetric DMD.

Stage 1 – AR diffusion training: fine‑tune the camera‑controllable bidirectional model with a causal attention mask and teacher‑forcing, yielding an autoregressive diffusion model.

Stage 2a – causal ODE initialization: sample a timestep $t$ from a predefined set $S$, regress the few‑step model $G_\theta$ from noisy frame $x_i^{t}$ to clean frame $x_i^{0}$ using trajectories generated by the AR teacher.

Stage 2b – causal CD initialization: replace ODE data with a single ODE step; train $G_\theta$ to match the teacher’s one‑step prediction, weighted by $w(t)$.

Stage 3 – asymmetric DMD: initialize the student from the Stage 2 model, self‑roll out full videos $\tilde{x}$, and minimize KL divergence between the student’s marginal $p_{\theta,t}(\tilde{x}_t)$ and the bidirectional teacher’s distribution using discriminator scores $s_{\text{real}}$ and $s_{\text{fake}}$.

Distillation loop for the AR pipeline (high‑level pseudocode).

**Figure 1.** Overview of minWM. minWM is a full-stack pipeline that converts T2V/TI2V foundation models into camera-controllable few-step autoregressive world models, covering data construction, controllable fine-tuning, AR training, distillation, and low-latency inference.

Empirical Evaluation and Ablations

Few-step AR models cut first‑frame latency by >200× while keeping camera control.

Few-step AR models achieve up to 223× speedup over multi‑step bidirectional baselines while preserving camera‑controllable generation.

Table 1 shows a 223.75× speedup for HY1.5 and a 236.64× speedup for Wan2.1 when switching to the few‑step AR configuration.

**Figure 2.** Camera-controllable generation with the distilled few-step AR model. The model supports generation under different camera actions, showing that the distillation algorithm effectively preserves the camera controllability of the base model.

**Figure a.** Training with estimated camera poses. In our experiments, models trained directly on SpatialVid [34] data did not achieve reliable camera-controllable generation under our current setup, even after additional data filtering. We hypothesize that this may be related to the use of perception-estimated camera poses, which motivates our exploration of datasets with effectively ground-truth trajectories.

**Figure b.** Training with reconstructed scenes and rendered trajectories. By reconstructing scenes from DL3DV [35] and rendering videos along prescribed camera trajectories, the model successfully learns camera-controllable generation, indicating the importance of accurate camera trajectories.

**Figure c.** Training with WorldPlay-generated trajectories. For the open-source setting, we construct videos from OpenVid [36] and other image sources using WorldPlay [8] with specified camera trajectories, which likewise enables the model to learn camera controllability.

Latency reduction comes at the cost of careful training choices to retain camera controllability.

Conclusion and Future Directions

We recap minWM’s current abilities, experimental insights, and planned extensions.

minWM is a full‑stack open‑source framework that fine‑tunes bidirectional T2V or TI2V models for camera‑controllable generation and distills them into real‑time AR models. It currently supports HY1.5 and Wan2.1 base models. The authors plan to add further control conditions such as pose and to broaden model support.

Training directly on SpatialVid data with perception‑estimated camera poses failed to yield reliable camera‑controllable generation. The authors attribute this to inaccuracies in the estimated poses, prompting a shift toward datasets that provide ground‑truth trajectories.

When scenes are reconstructed from DL3DV and videos are rendered along prescribed camera trajectories, the model learns camera‑controllable generation successfully. Similarly, using WorldPlay to generate videos with explicit camera paths enables controllability in the open‑source setting.

Training progress shows a clear timeline: after 1–2 k steps the HY1.5‑based bidirectional model lacks effective controllability; around 5 k steps it begins to respond but remains unstable; after 8 k steps it achieves strong, reliable camera‑controllable generation.

Batch size also impacts learning: batches smaller than 4 cause the Wan2.1‑based model to often fail, batch size 8 improves controllability substantially yet remains somewhat unstable, and batch size 16 completes the pipeline with high controllability.

Read the original paper

Open the simplified reader on Paperglide