World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Zefu Lin, Rongxu Cui, Junjia Xu, Xiaojuan Jin, Wenling Li, Lue Fan, Zhaoxiang Zhang

World Pilot steers VLA policies by injecting video-pretrained dynamics latents and action trajectories.

How can we improve VLA robot manipulation by steering the model with future-scene and action-trajectory priors from a World-Action Model?

Vision-Language-Action (VLA) models rely on static image-text pretraining, leaving them blind to how scenes evolve under physical action and fragile to shifts in viewpoint or geometry. World Pilot augments these policies by routing two priors from a frozen World-Action Model (WAM) into the VLA: a scene-evolution latent injected via cross-attention at the perception layer, and a trajectory-level motion token injected at the action generator. This dual-steering approach achieves a state-of-the-art 84.7% success rate on the LIBERO-Plus out-of-distribution benchmark, with the largest gains appearing under viewpoint and pose shifts.

Paper Primer

The core challenge is that VLA models lack an internal account of dynamics. While image-text pretraining provides semantic grounding, it cannot predict how a scene changes when the robot moves, causing the policy to fail when test-time conditions—like camera angle or object state—drift from the training distribution.

World Pilot solves this by treating a video-pretrained WAM as a frozen "dynamics advisor." It uses Latent Steering to inject spatiotemporal cues into the VLM's hidden states, and Action Steering to bias the action generator with a coarse motion trajectory. The system is like a navigator providing a driver with both a map of the road ahead and a suggested path, allowing the driver to remain in control of the steering wheel.

World Pilot significantly improves robustness to out-of-distribution (OOD) visual and physical shifts.

Evaluation on the LIBERO-Plus benchmark, which tests 10,030 perturbed tasks including camera, light, and background shifts. 84.7% total success rate, outperforming the next best baseline by 2.6 points, with a 13.2-point gain specifically on camera-viewpoint perturbations.

Why inject the WAM's output as a latent rather than just generating a future image?

Pixel-space outputs contain action-irrelevant details like lighting, texture, and background artifacts that dilute the dynamics signal. The latent representation encodes only the structure of scene evolution, which is more directly useful for control.

Does this approach require co-training the world model and the policy?

No. The WAM remains frozen throughout fine-tuning, and its outputs can be precomputed. This prevents the VLA's training gradients from corrupting the WAM's pretrained world priors.

For researchers building VLA policies, this paper demonstrates that video-pretrained models can be effectively used as frozen priors to bridge the gap between static semantic grounding and dynamic physical control.

Motivation and Problem Framing

We expose the VLA grounding gap and introduce World Pilot to inject dynamics priors from a World‑Action Model.

Vision‑Language‑Action (VLA) models inherit semantic grounding from large‑scale image‑text pretraining, which is inherently static. Manipulation, however, is a continuous, contact‑rich process whose dynamics cannot be captured by static pairs. Consequently, VLAs become fragile when viewpoint, geometry, or contact conditions drift from the training distribution.

World Pilot augments a VLA policy with two priors extracted from a frozen World‑Action Model (WAM). Latent Steering injects a scene‑evolution latent into the perception layer, while Action Steering supplies an anticipated trajectory as a single prefix token to the action generator. Together these priors give the VLA an anticipatory view of the scene and a motion hint alongside its semantic conditioning.

VLAs lack a model of how the environment will change under their own actions because their pretraining only sees static image‑text pairs.

**Figure 1.** World Pilot steers a VLA with priors from a World-Action Model. VLA methods generate actions from a VLM’s encoding of the scene. World Pilot adds two priors from a WAM into the decision chain, with *Latent Steering* routing a scene-evolution latent into VLM hidden states and *Action Steering* feeding a trajectory-level motion prior to the action generator. This gives the VLA an anticipated view of the scene and a motion hint alongside its semantic conditioning. World Pilot reaches state-of-the-art performance on LIBERO-Plus and real-robot tasks.

World Pilot attains a state‑of‑the‑art total success rate of 84.7% on the LIBERO‑Plus zero‑shot OOD benchmark and the highest success on every real‑robot setting across four manipulation tasks. Margins are largest under shifts in viewpoint, geometry, deformable state, and pose, confirming that the injected dynamics priors improve robustness. Ablations demonstrate that each pathway—Latent Steering and Action Steering—contributes independently to these gains.

The disconnect between static pretraining and dynamic manipulation limits VLA robustness.

The World Pilot Architecture

World Pilot augments a VLA policy with dynamics priors from a video‑trained world model.

Standard Vision‑Language‑Action (VLA) policies inherit semantic grounding from image‑text pretraining but miss how scenes evolve under actions, leading to poor physical fidelity.

A video‑trained model that, given the same observation and instruction, predicts how the scene will change and what coarse action trajectory will occur.

How does the WAM differ from the Vision‑Language Model used for semantic encoding?

The WAM is trained on video sequences and learns to predict future visual states and motion, whereas the VLM is trained on static image‑text pairs and only provides a semantic embedding of the current frame.

Cross‑attention injects the predicted future‑scene latent into each VLM token, letting the perception layer anticipate dynamics without breaking the token order.

Compute weighted sums: token 1 receives $0.6\cdot0.5+0.4\cdot0.5 = 0.5$, token 2 receives $0.3\cdot0.5+0.7\cdot0.5 = 0.5$, etc., yielding a cross‑attention output of $\begin{bmatrix}0.5&0.5\\0.5&0.5\\0.5&0.5\\0.5&0.5\end{bmatrix}$.

Add the output to $H_t$: $\bar{H}_t = H_t +$ cross‑attention result $= \begin{bmatrix}1.5&0.5\\0.5&1.5\\1.5&1.5\\0.5&0.5\end{bmatrix}$.

The residual preserves the original token order while each token now carries a dynamics cue (the added $0.5$ in every dimension).

The residual form lets the model keep its original semantic token layout, so downstream modules that expect a fixed sequence still operate unchanged.

Why not simply add a learned bias vector to $H_t$ instead of using cross‑attention?

Cross‑attention lets each token attend to the most relevant parts of the future‑scene latent, which can differ across spatial regions. A uniform bias would apply the same dynamics signal to every token, discarding spatial specificity.

A single prior token summarises the anticipated trajectory, giving the action generator a high‑level motion cue while leaving per‑step generation free.

Why encode the trajectory as one token instead of feeding the full sequence?

Encoding a single token preserves flexibility: the generator can still shape the final action chunk based on both the dynamics‑enhanced hidden states and the high‑level motion cue, whereas feeding the full sequence would lock the generator to the WAM’s exact step‑by‑step predictions.

Collect visual observation $O_t$, language instruction $\ell$, and optional proprioceptive state $q_t$.

Encode $O_t$ and $\ell$ with the Vision‑Language Model to obtain hidden states $H_t$.

Run the frozen World‑Action Model $W_\phi$ to produce $Z_w^t$ and $A_w^t$.

Apply Latent Steering: compute $D_w^t = f_{\text{dyn}}(Z_w^t)+\rho_{\text{fut}}$, then update $\bar{H}_t = H_t + \text{CrossAttn}(H_t, D_w^t)$.

Apply Action Steering: align $A_w^t$ to horizon $K$, encode to token $s_w^t$, and prepend it to the generator input.

Generate a noisy trajectory $X_{\tau,t} = \tau A_t^\star + (1-\tau)\epsilon$ and feed $[u_t; s_w^t; Q_t; X_{\tau,t}]$ together with $\bar{H}_t$ to the flow‑matching action generator $g_\theta$.

Compute the weighted L2 loss $\mathcal{L}_{\text{World Pilot}}$ (Equation 4) and back‑propagate to update $\theta$.

Empirical Results and Ablations

World Pilot sets the new OOD benchmark on LIBERO‑Plus.

World Pilot achieves the highest total success rate on LIBERO‑Plus.

Average $84.7\%$ over three random seeds, beating the next best method by $+2.6$ points (see Table 1).

**Table 1.** Simulation results on LIBERO, LIBERO-Plus, and RoboCasa. All LIBERO-Plus numbers come from training on LIBERO only and evaluating zero-shot on its OOD perturbations. The LIBERO-Plus numbers for Cosmos Policy [23] and DreamVLA [59] are our own runs, as is the RoboCasa number for ABot-M0 [6], and the remaining LIBERO-Plus baselines are taken from ABot-M0 and Being-H0.7 [66]. We rerun ABot-M0 on RoboCasa because the ABot-M0 paper reports RoboCasa on the GR1 split rather than the original benchmark used here.

Read the original paper

Open the simplified reader on Paperglide