AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Yu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang, Lei Ke, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Kun Gai, Yujiu Yang

AnchorWorld enables embodied egocentric simulation by combining full-body motion control with pose-associated anchor views for localized world customization.

How can we synthesize controllable, egocentric video environments that allow for both human-action-driven movement and customizable, evolvable scene layouts?

Egocentric simulators struggle to maintain consistent environments because first-person videos lack full-body visibility, making it difficult to ground human actions or specify what exists outside the current field of view. AnchorWorld addresses this by using third-person videos to supervise full-body motion and introducing "anchor views"—localized RGB images paired with 3D poses and evolution prompts—to explicitly define and evolve world states at specific coordinates. This hybrid approach allows the model to synthesize egocentric videos that remain spatially consistent even when the agent moves to regions previously unseen in the initial frame.

Paper Primer

The core mechanism hinges on a projection-based training strategy: the model learns to map 3D human motion into 2D visual observations by training on third-person data where the full body is visible, then adapting to first-person head-mounted perspectives. To customize the world, the model treats anchor views as in-context priors, injecting their spatial poses and evolution prompts into the generation process to ensure local scene states remain consistent across changing viewpoints.

AnchorWorld significantly improves scene consistency and action accuracy compared to existing egocentric simulators.

Quantitative evaluation across static and dynamic test sets, including Unreal Engine and real-world captures, shows superior performance in camera trajectory accuracy (RRE) and semantic consistency (CLIP-V) metrics. The method achieves the best results in scene consistency and camera accuracy across all tested scenarios, with notable generalization to out-of-distribution scenes where initial ego-views have limited overlap with anchor views.

Why is third-person video data necessary for an egocentric simulator?

In first-person videos, most of the human body is outside the camera's field of view, making supervision of full-body motion sparse and weakly aligned. Third-person videos provide the complete interaction context and full-body visibility required to learn robust spatial grounding between human motion and scene responses.

How does the model maintain scene consistency when the agent moves to a new location?

The model uses pose-associated anchor views that are grounded in a unified 3D coordinate system. By injecting these anchor views as in-context priors and coupling them with spatial pose embeddings, the model can retrieve and preserve local scene states even when those regions were not visible in the initial frame.

Introduction

We expose the controllability gap in interactive world models and introduce AnchorWorld to fill it.

Current interactive world models struggle to offer the fine‑grained, evolvable control needed in real‑world applications. From a first‑person perspective they receive only sparse visual cues, while the underlying environment is defined implicitly, leaving no way to prescribe or evolve local scene elements.

An interactive world model is a generative system that predicts future visual observations while allowing external agents to intervene and modify the scene.

AnchorWorld addresses these gaps by introducing two complementary controls: (1) hybrid‑view action conditioning that leverages third‑person video to supervise full‑body motion, and (2) pose‑associated anchor views that supply localized visual priors, 3‑D poses, and evolution prompts for scene customization.

**Figure 1.** Showcasing AnchorWorld. (a) AnchorWorld synthesizes egocentric videos conditioned on human action and initial ego-view frame. (b) It further enables world customization with conditional anchor views, which provide local appearance, 3D pose, and evolution prompts for scene evolution.

The core gap is that existing world models cannot be both interactive and evolvable; AnchorWorld bridges this by coupling egocentric action control with localized, pose‑anchored scene customization.

Hybrid-View Action Control

Combining third‑person and egocentric views enables robust action control across viewpoints.

In egocentric videos most of the body lies outside the camera view, so supervising full‑body action is sparse. To remedy this, we augment training with Third‑Person View (TPV) videos that capture the entire human and its interaction with the scene.

We fuse the full‑body motion sequence with the camera trajectory, letting the model project 3D human motion into any viewpoint; this lets a single network learn from both TPV and First‑Person View (FPV) data.

Motion encoder maps $M$ to $z_m$ of shape $2 \times 2 \times d$ (here $d=4$) by a linear projection.

Camera encoder maps $C$ to $z_c$ of shape $2 \times 1 \times d$ (same $d$).

Video latent $z(t)_v$ (e.g., from a pretrained VAE) is $2 \times 4 \times d$.

We concatenate to obtain $T$ of shape $2 \times (4+2+1) \times d = 2 \times 7 \times d$.

Spatial self‑attention mixes information across the 7 tokens per frame; the Truncate operator removes the last three pose tokens, leaving $v$ of shape $2 \times 4 \times d$.

This toy example shows that the same motion can be projected onto different camera viewpoints while keeping the video token count unchanged, which is the core of hybrid‑view control.

Why can’t we train directly on FPV videos without TPV data?

FPV videos hide most limbs, so the model never sees the full body‑scene interaction. TPV videos supply those missing cues, letting the network learn a view‑invariant mapping from 3D motion to visual observations before it is asked to synthesize egocentric views.

We introduce training signals gradually: first the model learns pure action from third‑person videos, then adapts to egocentric action, then we add static anchor‑view cues, and finally we enable dynamic evolution prompts.

Stage I – Train the action backbone on large‑scale TPV videos (full‑body visible).

Stage II – Fine‑tune the same backbone on FPV videos, aligning the camera pose with the head orientation.

Stage III – Add static anchor‑view inputs (RGB image $I_i$ and pose $c_i$) and train the 3‑D‑attention module.

Stage IV – Incorporate evolution prompts $t_i$ via cross‑attention, enabling dynamic scene evolution.

**Figure 3.** Progressive multi-stage training strategy. Stage I: TPV action training; Stage II: FPV action training; Stage III: static anchor-view customization; Stage IV: dynamic anchor-view evolution.

Why is the curriculum ordered from TPV to FPV to anchor views rather than the reverse?

Learning a robust projection from 3‑D motion to visual tokens requires full‑body observations, which TPV provides. Once the projection is stable, aligning it to the head‑centric FPV is straightforward. Adding anchor‑view cues before the action backbone is reliable would confuse the pose encoder with noisy world priors; the staged order ensures each new signal builds on a solid foundation.

Anchor-View Customization

Anchor views inject localized image, pose, and text priors into the video latent stream for targeted world customization.

Customizing a world model requires localized control over visual appearance, spatial pose, and temporal evolution. Anchor views supply these priors without altering the base video architecture.

Anchor views act as plug‑in priors that provide a local image, a 3‑D pose, and a text prompt, which are merged with the video latent stream so the model can ground synthesis to specific scene elements.

Concatenate along the frame axis: $T_{\text{total}} = [z_v(t); z_s] = \begin{bmatrix}v_{11}\\ v_{21}\\ v_{12}\\ v_{22}\\ a_{1}\\ a_{2}\end{bmatrix}$ (3 frames × 2 positions).

Pose embedding $z_{\text{pose}} = \begin{bmatrix}p_{1}\\ p_{2}\\ p_{3}\end{bmatrix}$ is broadcast to each spatial position and added: each token becomes $v_{ij}+p_{i}$ or $a_{k}+p_{3}$.

For a text prompt $t_1$, the mask $M$ allows attention only to rows 1‑4 (video) and rows 5‑6 (its anchor), blocking any other hypothetical anchors.

Self‑attention then aggregates information respecting the mask, yielding a representation that combines video content with the anchored image and pose.

The example shows that anchor‑view tokens are not merely extra frames; distinct positional embeddings and the mask keep their influence localized, enabling precise control over where and how the priors affect synthesis.

How does concatenating anchor‑view tokens differ from simply appending extra video frames?

Concatenation adds new tokens, but RoPE assigns them unique frame‑axis positions and the subsequent mask restricts attention to the appropriate subset. Plain extra frames would share the same positional IDs and could interfere with the video sequence, whereas anchor views remain isolated and controllable.

Stage I – Train on large‑scale third‑person (TPV) videos to learn generic action‑conditioned generation.

Stage II – Fine‑tune on first‑person (FPV) videos, aligning the camera trajectory with the character’s head pose.

Stage III – Train on static scenes with pose‑aware anchor views, teaching the model to respect spatial constraints while roaming egocentrically.

Stage IV – Incorporate dynamic data and evolution prompts, enabling text‑driven local state changes for each anchor view.

Experimental Results

Quantitative and qualitative evaluation of AnchorWorld against baselines in egocentric action control.

Our method improves camera accuracy by +0.021 over the next‑best baseline.

Table 1 shows 0.885 for our method versus 0.864 for PlayerOne‑Scene, the closest competitor.

Wan2.2 TI2V is a 5‑billion‑parameter image‑to‑video backbone that generates 77‑frame, 480p videos from a single image prompt.

**Figure 9.** Visualization results of egocentric action control. We show the results compared with baseline methods and our ablation settings.

**Figure 4.** Qualitative Comparison. The gray mask denotes the human action and its location in the anchor view. During inference, the gray-masked region in the anchor view is inpainted. Red wireframes visualize the 3D anchor-view poses. Our method achieves better egocentric action control, scene consistency under large viewpoint changes, and dynamic scene evolution.

**Figure 5.** Qualitative comparison on rendered UE scenes and real-world captured scenes.

AnchorWorld consistently outperforms prior baselines in egocentric action control, delivering the highest camera accuracy and superior scene consistency.

Ablation Studies

Ablations reveal each design choice’s impact and confirm the model’s ability to handle unseen scene dynamics.

We first assess how each design choice influences performance. Table 2 reports quantitative and visual effects of removing individual components.

Removing Stage‑I third‑person video training or the projection‑based control design degrades both quantitative metrics and visual scene consistency. Likewise, omitting anchor‑view pose or anchor‑view RoPE harms spatial perception, confirming their role in pose‑aware conditioning and view discrimination.

We also test the model’s ability to infer unseen actors. By varying when the egocentric viewpoint switches, the model reveals a hidden person earlier or later, matching the caption’s timing (e.g., a person standing up from a sofa).

Additional Analyses

We quantify how anchor‑view design choices affect scene consistency and control.

The paper’s core idea is to embed localized “anchor views” that supply visual and 3‑D priors, enabling both human‑action and scene‑state control.

Adding more anchor views markedly improves scene‑consistency scores.

Table 5 shows a rise from 4 074.94 with a single anchor view to 4 233.59 with three views.

An evolution prompt is a textual description of how the scene should change over time; the model conditions each generated frame on this prompt together with the static anchor view.

How does Evolution Prompt Control differ from a standard text prompt?

Standard prompts only describe the target appearance, whereas an evolution prompt encodes a temporal trajectory of state changes; the model uses the trajectory to steer frame‑by‑frame updates anchored to the anchor view.

**Figure 8.** Evolution prompt control. We demonstrate that, within the same anchor-view image, modifying the evolution prompt enables control over different state changes.

Qualitative comparisons on egocentric action control (Section C.2) reveal that our projection‑based method yields accurate body‑motion synthesis, while baselines either miss motion cues or lack full viewpoint control.

**Figure 6.** Out-of-Sight Scene Evolution. We show that our model can infer scene evolution beyond the observed view by varying the timing of the action-induced viewpoint transition. Even when dynamic scene elements are not visible, our model can still reason about their state changes.

**Figure 7.** Spatial Pose Awareness. We horizontally flip the human pose and anchor-view pose while keeping the anchor-view image fixed, creating overlapping and non-overlapping view settings.

Table 4 quantifies third‑person action control: our full 6‑D pose representation reduces WA‑MPJPE to 28.01, a substantial gain over joint‑position‑only baselines.

Table 6 reports fine‑grained VBench metrics; across all scene categories our method attains the highest scores on subject consistency, background consistency, and aesthetic quality.

Limitations and Failure Modes

Failure cases: inconsistent details, motion blur, and prompt limitations.

AnchorWorld relies on localized anchor views to let users edit scene elements; however, when a local region contains complex structures and rich texture, the VAE’s 16× spatial downsampling discards fine‑grained information, producing inconsistent results.

**Figure 1.** (a) Inconsistent fine-grained details

Rapid viewpoint changes in first‑person training videos introduce motion blur, which the model reproduces; combined with the base model’s limitations, generated hands often appear degraded in visual quality.

Table 7 presents the Evolution Prompt template that structures queries for external character description in egocentric video; it specifies the analyst role, objective, detection phase rules, description strategy, four description aspects, and strict constraints to avoid hallucinations.

Adopting higher‑capacity base models with lower downsampling ratios is expected to alleviate both the fine‑detail inconsistency and the hand‑quality degradation observed here.

Implementation and Supplemental Details

Appendix provides implementation specifics, training stages, and limitations.

Section A details the implementation pipeline, starting with the progressive training schedule summarized in Table 3.

**Table 3.** Training stages and settings.

All videos are resized to 480 p while preserving aspect ratio, and each stage runs on 16 NVIDIA GPUs with a batch size of 16, learning rate $1\times10^{-4}$, and AdamW optimizer.

During training we randomly drop pose conditions and anchor‑view information with a 5 % probability, and at inference we use 50 denoising steps with a classifier‑free guidance scale of 5.

Anchor views are third‑person images that serve as visual priors; because the datasets sometimes contain the player in these images, we do not apply inpainting to remove the player, avoiding artifacts that would degrade image quality.

At inference time, providing clean anchor‑view images does not hurt performance, since the model learns to ignore the player from first‑person supervision and relies on pose cues to resolve spatial relationships.

Egocentric videos from Ego‑Exo4D are undistorted from fisheye format and brightened modestly to compensate for low illumination, while color mismatches between third‑person and egocentric captures are handled by the model’s shared pose conditioning.

Human motion and anchor‑view poses are estimated with GVHMR, yielding 22 body joints per subject; hand joints are omitted because egocentric hand visibility is unreliable.

In multi‑person scenes we manually verify subject assignments, correcting errors and discarding samples with poor motion estimates.

Evolution prompts are generated by Qwen3‑VL‑32B‑Instruct using curated templates (see Table 7).

Section B outlines current limitations: the system focuses on short clips and lacks mechanisms for long‑term world exploration, real‑time autoregressive interaction, and memory for extended horizons.

Training data is confined to a limited set of scenarios; expanding to open‑world collections is a future direction.

Dynamic scenario diversity is restricted because current egocentric datasets provide only a single evolution description per anchor view; future work will aim to model anchor‑specific dynamics.

Read the original paper

Open the simplified reader on Paperglide