WorldOlympiad: Can Your World Model Survive a Triathlon?

Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu, Yinghao Yu, Jiasheng Tang, Fan Wang, Wei Wang, Bohan Zhuang

WorldOlympiad is a diagnostic benchmark for evaluating video world models on physical, geometric, and interactive fidelity.

How can we systematically evaluate whether video-generation models actually understand physical laws and spatial consistency, rather than just producing visually plausible but incoherent sequences?

Video generation models are increasingly used as world simulators, but existing benchmarks focus on visual aesthetics rather than whether the generated content obeys physical laws or maintains 3D consistency over time. WorldOlympiad evaluates these models by decomposing long-horizon video generation into three tracks: physical rule adherence, geometric stability via Gaussian splatting, and chunk-by-chunk interactive control. The benchmark reveals that while top models are improving at physical reasoning, they still struggle significantly with 3D geometric consistency and long-horizon state preservation.

Paper Primer

The benchmark addresses the "world-modeling gap" where models produce visually plausible but physically incoherent videos. It uses a three-stage pipeline—chunking, captioning, and refinement—to create a 1,000-video test set across robotics, gaming, and real-world domains, ensuring that evaluation is grounded in specific, controllable actions rather than passive observation.

WorldOlympiad evaluates models through a multi-dimensional judge system: it uses rule-based MLLM judges for physical mechanics and thermodynamics, Gaussian splatting to reconstruct 3D scenes for geometric consistency, and a hybrid CLIP-MLLM metric to score interaction fidelity across temporal boundaries. This approach allows the benchmark to pinpoint whether a model fails due to physical implausibility, spatial drift, or broken action continuity.

Large-scale model capacity correlates with higher overall world-modeling performance, but targeted physical training can compensate for smaller scale.

LingBot-World (14B parameters) achieved the highest overall score (0.683), while the smaller Cosmos-Predict-2.5 (2B parameters) reached a competitive 0.671 by prioritizing physical-world prediction. Top models achieved physical faithfulness scores exceeding 0.90, yet geometric consistency scores for all models remained below 0.43.

The WorldOlympiad automatic evaluation is highly aligned with human preference.

A controlled alignment study comparing automatic rankings with human annotators across eight models yielded a Spearman correlation coefficient of $\rho = 0.95$. The high correlation confirms that the automated diagnostic metrics capture the same quality distinctions perceived by human evaluators.

Why is 3D geometric consistency so difficult for current video world models to maintain?

The paper observes a trade-off: models that rely on camera control to maintain spatial layout often achieve better geometry scores but struggle with the complex, open-ended object manipulation and state transitions required for robust world simulation.

How does this benchmark differ from existing video evaluation suites like VBench?

Unlike VBench, which focuses on visual quality, aesthetics, and short-term temporal smoothness, WorldOlympiad specifically targets long-horizon world-modeling capabilities: physical law adherence, 3D geometric stability, and controllable interaction across multiple downstream domains.

Researchers should shift focus from purely visual metrics to diagnostic evaluation of physical and geometric consistency, as these are the primary bottlenecks preventing current video generators from functioning as reliable, interactive world simulators.

Introduction: The Triathlon of World Models

Introducing WorldOlympiad, a benchmark that tests video world models for physical, geometric, and interactive fidelity.

Current video generation benchmarks evaluate appearance, motion smoothness, and short‑term semantics, but they ignore whether generated videos obey physical laws, maintain coherent 3D structure, or follow user‑driven interactions over long horizons.

WorldOlympiad probes a video model’s ability to act as a true “world model” by testing three orthogonal capabilities: physical plausibility, 3‑D geometric consistency, and controllable interaction over extended sequences.

**Figure 1.** Overview of the WorldOlympiad pipeline for data collection, long-video generation, and multi-dimensional evaluation.

Experiments on eight state‑of‑the‑art long‑video generators expose systematic shortcomings in physical reasoning, 3‑D stability, and interaction control, confirming that visual quality alone is insufficient for world‑model evaluation.

The shift from aesthetic evaluation to physical faithfulness is essential for deploying video world models in gaming, robotics, and real‑world simulation.

Benchmark Design and Data

WorldOlympiad aggregates three video domains and a three‑stage pipeline to evaluate world models on physical and interaction consistency.

The benchmark combines three complementary video domains—robotics, gaming, and real‑world—to stress‑test video world models on manipulation, interactive control, and open‑domain dynamics. Each domain contributes a distinct set of physical and geometric challenges, ensuring a comprehensive evaluation.

How does WorldOlympiad differ from typical video generation benchmarks?

Typical benchmarks focus on visual fidelity or short‑term prediction; WorldOlympiad explicitly requires long‑range consistency, physical plausibility, and accurate action‑level captions, which forces models to maintain coherent dynamics across many seconds.

Stage I – Chunking: identify the main execution interval and split the video into up to six left‑closed, right‑open chunks with no temporal gaps.

Stage II – Caption: run Gemini‑3‑Pro‑Preview on each chunk to produce an action label (camera movement) and a descriptive English caption.

Stage III – Refine: feed the full video and the ordered chunk captions back to Gemini‑3‑Pro‑Preview to correct hallucinations, harmonize terminology, and ensure narrative continuity.

Chunking creates three intervals: [0‑60), [60‑120), [120‑180).

Captioning produces: (1) “Player moves forward, encounters enemy”; (2) “Player defeats enemy, picks up loot”; (3) “Player exits arena, camera pans to horizon”.

Refinement merges the three captions, fixes the duplicated “player” reference, and adds a consistent action label “W” for forward movement in the first chunk.

The refinement step eliminates cross‑chunk inconsistencies that would otherwise cause the MLLM judge to penalize the model for contradictory narratives.

**Figure 2.** Data collection overview across robotics, gaming, and real-world video sources.

**Figure 3.** Data standardization pipeline from raw videos to refined action-caption annotations.

**Figure 4.** Pipeline statistics for data processing, annotation coverage, and evaluation-ready samples.

Related Work

We position prior video‑generation and world‑model work relative to our benchmark.

Diffusion‑based video generation models have shown emergent physical consistency—object permanence, 3D coherence, and plausible motion—yet they are typically trained for short clips of 5–10 seconds, limiting their use as persistent world‑model simulators. Block diffusion addresses this by performing iterative denoising within each temporal block and conditioning on prior blocks via cross‑block KV caching, preserving intra‑block quality while enabling scalable long‑horizon synthesis.

Video generation models are increasingly deployed as World Models for interactive domains such as game generation (e.g., GameGen‑X, Matrix Game) and robotics simulation, where they provide policy generation and data augmentation. Maintaining a persistent world state and supporting real‑time interaction remains challenging, prompting two research strands: implicit memory mechanisms (e.g., LongLive’s KV caching) and explicit 3D memory mechanisms, with recent hybrid approaches like MosaicMem and Inspatio World combining both.

Existing video‑generation benchmarks (VBench, VBench 2.0) cover visual quality, motion, and semantic consistency, while newer world‑model benchmarks evaluate physical law adherence and simulation fidelity. However, none provide unified coverage across gaming, robotics, and general scene generation, nor do they assess interactive functionality—a gap that the proposed WorldOlympiad benchmark aims to fill.

**Table 1.** Comparison of existing benchmarks across evaluation metrics and video tasks.

Defining Physical Faithfulness

Defines how physical, geometric, and interaction fidelity are measured for video world models.

The evaluation suite splits fidelity into three orthogonal tracks—physical, geometric, and interaction—each scored by a dedicated judge and then averaged for a final leaderboard rank.

Physical Faithfulness checks whether a generated video obeys basic physics rules such as gravity, buoyancy, compression, and impact.

Is Physical Faithfulness just a measure of visual realism?

No. It specifically evaluates rule compliance—whether the observed motion or deformation follows the underlying physical law, independent of how aesthetically pleasing the frame looks.

3D Spatial Consistency quantifies how faithfully the static scene geometry reconstructed from a video matches the ground‑truth layout.

Why not collapse the three geometry signals into a single depth error?

Each signal captures a distinct failure mode: $S_{\text{recon}}$ checks static layout, $S_{\text{meta}}$ checks cross‑view coherence, and $S_{\text{traj}}$ checks dynamic camera motion. A single depth error would miss many of these aspects.

Interaction Quality measures whether a chunk‑wise generated video follows its textual instruction, transitions smoothly between chunks, and remains globally coherent.

How does the CLIP component differ from the MLLM judge?

CLIP computes a cosine similarity between frame and caption embeddings, offering a fast proxy for semantic alignment. The MLLM judge, by contrast, reasons over the entire chunk or video, scoring visual quality, text alignment, and temporal smoothness.

Run an MLLM to detect the most relevant moving or deforming entities in the generated video.

Apply SAM3 to obtain object‑centric masks and trajectories for those entities.

A relevance judge checks whether the target physical phenomenon appears in the ground‑truth reference video; irrelevant metrics are discarded.

A compliance judge compares the generated video against the reference, outputs a pass/fail decision, a confidence score, and a short explanation.

Average compliance scores within each subset (mechanics, thermodynamics, material).

Average the three subset scores to obtain the overall physical fidelity $S_{\text{phys}}$.

The relevance judge confirms both gravity and buoyancy are present in the ground‑truth video.

The compliance judge sees the ball accelerate downward (gravity pass) and the water surface rise slightly (buoyancy pass) and returns confidence 0.92 for each.

Mechanics subset average = (1 + 0.92)/2 ≈ 0.96; thermodynamics subset is not applicable and omitted.

Overall $S_{\text{phys}}$ = 0.96 (since only mechanics contributes).

The example shows how the pipeline discards irrelevant subsets and aggregates only the applicable rule scores.

Render a Gaussian‑Splat video $\bar{V}$ from the generated frames (up to $N\le32$ frames).

If object masks are available, remove foreground Gaussians so the 3‑D judge focuses on static geometry.

Run Depth Anything 3 to recover a Gaussian scene $G$ and per‑frame extrinsics $E_i$ and intrinsics $K_i$.

Compute reconstruction score $S_{\text{recon}}$ by feeding the rendered static view $\hat{V}_{\text{GS}}$ to the calibrated MLLM judge.

Compute meta‑view score $S_{\text{meta}}$ by feeding the diagnostic image $\hat{I}_{\text{meta}}$ to the same judge.

Recover camera trajectories $\{\hat{T}_i\}$, align them to the reference, and compute trajectory score $S_{\text{traj}}$ via the adaptive aggregation $\text{A}_{\text{motion}}$.

Average the three bounded scores to obtain $S_{3D}= \tfrac{1}{3}(S_{\text{recon}}+S_{\text{meta}}+S_{\text{traj}})$.

Divide the generated video into $T$ chunks $\{v_i\}$ and sample $m_i$ frames per chunk (default $m_i=8$).

Encode each sampled frame and its chunk caption with CLIP, compute cosine similarity, and average to obtain $S_{\text{clip}}$.

Clamp $S_{\text{clip}}$ to $[0,1]$ using thresholds $\tau_{\min}=0.20$, $\tau_{\max}=0.40$ to get $\tilde{S}_{\text{clip}}$.

Query the MLLM at three levels: chunk scores $a_i$, transition scores $b_i$, and a global score $g$, each clipped to $[0,5]$.

Aggregate to $S_{\text{chunk}}$, $S_{\text{trans}}$, $S_{\text{global}}$ and then to $S_{\text{mllm}} = \tfrac{1}{3}(S_{\text{chunk}}+S_{\text{trans}}+S_{\text{global}})$.

Combine the two components: $S_{\text{interact}} = (1-\lambda)S_{\text{mllm}} + \lambda\tilde{S}_{\text{clip}}$ with $\lambda=0.1$.

Finally, average the three core tracks: $S_{\text{all}} = \tfrac{1}{3}(S_{\text{phys}} + S_{3D} + S_{\text{interact}})$.

Benchmark Results

Key performance shifts across video world models and the remaining gaps.

LingBot‑World attains the highest overall score, surpassing the runner‑up by a modest margin.

Table 3 shows LingBot‑World with an All score of 0.683 versus Cosmos‑Predict‑2.5’s 0.671.

OpenWorldLib is the benchmark harness that runs each video‑generation pipeline on the same set of prompts and feeds the outputs to the WorldOlympiad evaluator.

Current models struggle significantly with long‑range physical consistency despite high visual quality.

Failure Modes and Qualitative Analysis

Human rankings closely match WorldOlympiad’s automatic scores, confirming alignment with human judgment.

WorldOlympiad repeatedly exposes three core ways generated videos break realism: implausible physics, broken 3‑D geometry, and incoherent interactions.

**Figure 6.** Representative WorldOlympiad case studies detected by the benchmark. The upper examples show high-quality generations that preserve the intended physical behavior, scene structure, or interaction state, while the lower examples show typical failure cases with visible rule violations, geometric inconsistency, or interaction drift.

**Table 13.** Representative case studies and the corresponding diagnostic signals.

Human preference rankings are highly consistent with WorldOlympiad’s automatic scores.

Spearman correlation $\rho$ = 0.95 across eight annotated models (Table 4).

Detailed Physical Diagnostics

Prompt templates define the judges that produce domain‑wise physical scores.

The benchmark supplies a family of prompt templates that drive four evaluation pipelines: dynamic‑object extraction, physical consistency, interaction quality, and 3D reconstruction. Each pipeline is implemented as a specialized MLLM judge that returns structured JSON scores.

**Table 5.** Judge-related prompt families used by WorldOlympiad.

A.1 Dynamic‑Object Extraction asks the MLLM to return the fewest (≤3) moving or deforming foreground nouns, discarding static background and ambiguous entities.

**Table 6.** Detailed WorldOlympiad scores on the same-scene subset across gaming, robotics, and general domains. All is the equal-weight average of Physical, 3D Cons., and Interact.

A.2 Physical judges first filter applicable physics questions (relevance) and then verify compliance against the reference video, requiring plausible motion, material identity, and temporal order.

**Table.** Performance evaluation of various pipelines across Gaming, Robotics, and General domains, measured by metrics including Gravitation, Buoyancy, Compression, Impact, Melting, Sublimation, Vaporization, Condensation, Deposition, Freezing, Color, Solubility, Hardness, and Combined score.

A.3 Interaction judges evaluate each generated chunk, the transition between adjacent chunks, and the stitched full video for visual quality, caption alignment, and long‑range consistency.

**Table 8.** Physical question pass rates on the same-scene subset.

Case Studies and Diagnostic Signals

Fine-grained interaction, geometry, and case‑study analyses reveal where world models break.

Tables 9 and 10 break down interaction and geometry performance on the same‑scene subset. The interaction scores (Chunk, Trans., Global, Long Range, Global Text, CLIP) measure caption following, boundary smoothness, and long‑range consistency, while the geometry scores ($S_{recon}$, $S_{meta}$, $S_{traj}$, 3D Cons.) assess reconstruction fidelity, meta‑view quality, camera‑trajectory alignment, and overall spatial coherence.

**Figure 7.** Robotics case study from WorldOlympiad. The example visualizes how the benchmark diagnoses physical interaction, object-state consistency, and temporal coherence in robotics world-model rollouts.

In the robotics scenario the benchmark checks whether the arm reaches the correct object, respects gravity (no floating parts), and maintains consistent object states across frames; failures appear as misplaced bounding boxes or abrupt scene changes.

**Figure.** A comparison of video generation results based on the prompt: "Along the way, they pass by patches of red flowers and several NPCs". The top row (marked with a green thumbs-up) shows successful generation where the camera moves forward, revealing red flowers and NPCs. The bottom row (marked with a red thumbs-down) shows unsuccessful generation where the scene remains static or fails to incorporate the requested elements.

For gaming rollouts the benchmark expects the avatar to follow the prompt (“drive the vehicle forward”) while preserving scene layout; failures include sudden perspective jumps, missing background, or inconsistent vehicle orientation.

**Figure.** Prompt: The camera begins with a view of an ancient, open-air amphitheater at night

Open‑domain videos are judged on three axes: physical plausibility (e.g., a frisbee must follow a ballistic arc), 3D consistency (stable background and object geometry), and interaction fidelity (continuous temporal evolution rather than static loops).

**Figure 8.** Gaming case study from WorldOlympiad. The example highlights how interactive game rollouts expose action-following, scene-state preservation, and cross-chunk transition failures.

Aggregating five annotators over 560 prompts yields 2,800 pairwise judgments. The resulting Spearman rank correlation between human preference ($\rho=0.95$) and the automatic WorldOlympiad score confirms that the benchmark’s metrics align closely with human judgments, with only two adjacent model pairs disagreeing.

Read the original paper

Open the simplified reader on Paperglide