Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova

Cosmos 3 unifies language, vision, audio, and action into a single Mixture-of-Transformers architecture for Physical AI.

How can a single unified model architecture jointly process and generate language, vision, audio, and action sequences to serve as a foundation for Physical AI?

Physical AI agents currently rely on a fragmented stack of specialized models—Vision-Language Models for reasoning, video generators for simulation, and action models for control—which is computationally inefficient and prevents shared world representations. Cosmos 3 introduces a unified Mixture-of-Transformers architecture that treats language, image, video, audio, and action as a single, interleaved token stream, using separate parameter towers for reasoning and generation while sharing a common latent space. This omnimodal approach establishes a new state-of-the-art across understanding and generation tasks, enabling a single model to function as a VLM, world simulator, and policy model without architectural changes.

Paper Primer

The core mechanism is a dual-tower Mixture-of-Transformers (MoT) design that routes autoregressive reasoning tokens and diffusion-based generation tokens through independent parameter sets. These towers interact via a dual-stream joint attention mechanism, allowing the generator to condition its output on the reasoner's latent state without the reasoner being updated by the generator.

To handle diverse physical embodiments, the model maps heterogeneous control inputs—such as robot gripper states or vehicle steering—into a unified action interface using domain-specific projection layers. This allows the model to learn shared geometric priors across different physical domains while maintaining embodiment-specific semantics.

Cosmos 3 achieves state-of-the-art performance across a diverse suite of Physical AI benchmarks.

The model was ranked as the top open-source Text-to-Image and Image-to-Video model by Artificial Analysis and the leading policy model by RoboArena at the time of the report. Top-tier performance across both generative and embodied reasoning tasks.

Why use a Mixture-of-Transformers (MoT) instead of a single monolithic transformer?

The MoT architecture separates the distinct computational requirements of autoregressive reasoning (which requires causal attention) and iterative denoising generation (which benefits from full bidirectional attention), allowing the model to optimize for both understanding and synthesis without parameter interference.

How does the model handle the different temporal resolutions of video, audio, and action data?

The model uses Absolute Temporal Modulation, which aligns tokens from different modalities onto a shared physical temporal axis by scaling the temporal index increment based on the specific sampling rate (TPS) of each modality.

Researchers can now replace fragmented pipelines with a single, omnimodal backbone that learns shared world dynamics, significantly simplifying the development of embodied agents that require both reasoning and physical interaction.

Abstract

Introducing a unified omnimodal model to close the gap between fragmented modality‑specific systems.

Current Physical AI pipelines stitch together separate vision, language, and action models, incurring costly data pipelines and inconsistent representations. Cosmos 3 proposes a single Mixture‑of‑Transformers backbone that can ingest and emit any combination of language, image, video, audio, or action tokens, eliminating the need for task‑specific model composition.

The Need for Omnimodal World Models

Physical AI needs a unified, scalable model to overcome real‑world training bottlenecks.

Training Physical AI agents directly in the real world is prohibitively slow, costly, and risky, so researchers rely on simulated environments to obtain safe, scalable learning.

Agents that perceive, reason about, and act within the physical world, requiring both accurate perception and forward‑looking planning.

Existing pipelines treat perception and simulation as separate modules—Vision‑Language Models (VLMs) for understanding, video‑generation or forward‑dynamics models for simulation, and Vision‑Language‑Action or World‑Action models for action prediction—leading to fragmented, compute‑heavy systems.

Cosmos 3 unifies these pillars in a single Mixture‑of‑Transformers backbone, jointly modeling language, image, video, audio, and action so the same network can switch between understanding, generation, and world‑action modes without architectural changes.

**Figure 1.** Cosmos 3 serves as a general-purpose backbone for Physical AI. By jointly modeling language, image, video, audio, and action for both understanding and generation, Cosmos 3 serves as a single network architecture for various model classes, including vision-language models, image generation models, audio-visual generation models, policy or world-action models, forward dynamics models, and inverse dynamics models.

**Figure 2.** Cosmos 3 offers a strong starting point for training Physical AI agents. Cosmos 3 can be post-trained on target data for distinct applications without architectural modifications. In this paper, we demonstrate how we post-train Cosmos 3 for better synthetic data generation (Sec. 4.2.3 and Sec. 4.2.4) and better robot policy (Sec. 4.2.5). In the future, we expect Cosmos 3 to play an essential role in generating high-quality, complex environments for training Physical AI agents.

The primary bottleneck for Physical AI agents is the cost and risk of real‑world training; a unified, omnimodal backbone like Cosmos 3 offers a scalable route to acquire both understanding and generation in simulation.

The Mixture-of-Transformers Architecture

Cosmos 3 packs multimodal inputs into a shared token stream and processes them with a dual‑tower transformer.

Cosmos 3 embeds language, vision, audio, and action into a common latent space and feeds the packed sequence to a Mixture‑of‑Transformers backbone that separates reasoning from generation.

MoT splits the transformer into a Reasoner tower for autoregressive tokens and a Generator tower for diffusion tokens, letting each specialize while sharing a common attention matrix.

How does MoT differ from a single transformer that processes all tokens together?

In a single transformer every token shares the same feed‑forward and attention parameters, so the model must learn a compromise between causal prediction and bidirectional denoising. MoT gives each task its own parameters, preserving the optimal inductive bias for each while still allowing the two streams to exchange information through the shared attention matrix.

The Reasoner tower consumes the autoregressive subsequence, producing next‑token predictions for language and vision while respecting causality.

Why not reuse the same parameters for both AR and diffusion tokens?

AR prediction benefits from a causal bias that prevents information leakage, whereas diffusion denoising needs full context to reconstruct corrupted tokens. Sharing parameters would force a trade‑off that harms both tasks; separate towers keep each bias intact while still allowing cross‑attention.

The Generator tower handles the diffusion subsequence, iteratively denoising noisy tokens to reconstruct images, video, audio, and actions.

Why does the Generator need bidirectional attention while the Reasoner stays causal?

The Generator must condition on the entire context (including future AR tokens) to correctly reconstruct a noisy latent; bidirectional attention provides that full view. The Reasoner, by contrast, must generate tokens sequentially, so a causal mask is required to avoid cheating.

Action tokens encode the physical transition between consecutive video states as a compact pseudo‑action built from relative pose and grasp components.

Compute $z = W^{\text{in}}_{\text{robot}}\, x$ → a 64‑dimensional latent vector (each entry is a linear combination of the 9 input components).

The Generator tower processes $z$ through several attention layers, producing an updated latent $z'$.

Decode $x' = W^{\text{out}}_{\text{robot}}\, z' + b^{\text{out}}_{\text{robot}}$, yielding a reconstructed translation of $(0.099, -0.051, 0.001)$ m and negligible rotation.

Even with a tiny toy example, the linear projections learn to preserve the essential geometric information while allowing the shared backbone to operate on a uniform dimensionality.

Why use a domain‑specific linear projection instead of a single shared one?

Different embodiments have heterogeneous action vector lengths and semantics (e.g., a vehicle has only ego pose, a robot has ego + effector + grasp). Separate projections let each domain map its raw vector into the common 64‑dim latent space without wasting capacity on padding or discarding information.

3D MRoPE assigns each token a triplet $(t, h, w)$ so that attention can respect temporal and spatial proximity across modalities.

Language token $i=0$: set $t=h=w=0$.

Video patch tokens $i=1\ldots4$: $t=1$ for all, $h,w$ take values $(0,0), (0,1), (1,0), (1,1)$.

Audio token $i=5$: $t=2$, $h=w=0$.

Action token $i=6$: $t=3$, $h=w=0$.

Even with a minimal example, the triplet encoding cleanly separates modalities: only the video tokens share spatial coordinates, while all others collapse to a single point in the $h\!-\!w$ plane.

How does 3D MRoPE differ from the standard 1D RoPE used for pure language models?

Standard RoPE encodes a single scalar position per token, which suffices when only temporal order matters. 3D MRoPE adds two spatial axes, allowing the model to distinguish tokens that are temporally aligned but spatially separated (e.g., different image patches), and to treat modalities with only temporal information as points in the spatial plane.

Token arrangement follows a simple rule: the autoregressive subsequence appears first, then the diffusion subsequence; within each subsequence, modalities are ordered vision → audio → action, and clean conditioning tokens precede their noisy counterparts.

Generation modes are defined by which tokens are treated as clean versus noisy. Forward dynamics denoises vision conditioned on clean actions; inverse dynamics denoises actions conditioned on clean vision; the policy mode denoises both simultaneously.

**Figure 5.** Mixture-of-Transformers (MoT) architecture of Cosmos 3. Left: a single transformer operates on one token sequence comprising the autoregressive (AR) and diffusion (DM) subsequences: AR carries discrete text tokens and, optionally, ViT-encoded vision tokens, ending with <EOS> and a begin-of-generation token <BOG>, while DM carries continuous tokens from their respective encoders, noise-perturbed during training. Here we visualize all input tokens as noisy for simplicity; for generation modes such as image-to-video or video transfer, clean conditioning tokens precede the noisy targets within DM; see Sec. 2.2.2. Within each transformer block, AR tokens and DM tokens are processed by independent LayerNorms and MLPs (all co-initialized from a pre-trained VLM) and meet only at a shared self-attention operator. Let Q, K, and V be query, key, and value vectors in attention, where the subscript indicates which tower it is in. `Q_AR` attends causally over `K_AR`, `V_AR` only, while `Q_DM` attends bidirectionally over the concatenated [`K_AR`; `K_DM`] and [`V_AR`; `V_DM`]. In this way, diffusion is conditioned on the AR context, while AR remains autoregressively self-contained. Outputs are next-token predictions for Reasoner and denoised tokens for Generator (trained in practice with a flow-matching objective predicting velocity; we show the clean target here for clarity). Right: the attention mask, causal for AR and full for diffusion.

**Figure 3.** **Unified action representation.** We map heterogeneous embodiment controls into compact action vectors built from shared geometric components. Ego and effector motions are encoded as relative-pose pseudo-actions using 3D translation and 6D rotation (an over-parameterized rotation representation by Zhou et al. (2019), as the degree of freedom of rotation is 3), while grasp states directly encode the current manipulation state, such as fingertip positions for hands or gripper open/close values for robots. Domain-aware input and output projections handle heterogeneous action-vector lengths while preserving the shared semantic space.

**Figure 4.** Action sequence configurations. For a video-action data sample, Cosmos 3 constructs different training modes by varying which tokens are clean and which are noisy. The diagram shows a local temporal window in which action tokens lie between adjacent video tokens: $a_t$ connects $v_{t-1}$ to $v_t$, and $a_{t+1}$ connects $v_t$ to $v_{t+1}$. Forward dynamics mode denoises vision tokens conditioned on clean action tokens; inverse dynamics mode denoises action tokens conditioned on clean vision tokens; and video-action (policy) mode denoises both vision and action tokens. Language and special tokens are omitted for compactness.

**Figure 6.** Illustrative coordinate assignment under 3D MRoPE. Left: A packed token sequence containing language, video (two frames, 2 x 2 spatial grid each), audio, and action tokens. Each token receives a (t, h, w) triplet. Language tokens use t = h = w; video tokens vary on all three axes; action and audio tokens use temporal coordinates only (h = w = 0). A modality offset k separates the text and vision temporal ranges. Right: FPS modulation maps frame indices to scaled temporal positions so that equal real-world durations occupy equal position ranges at 16, 24, and 30 FPS, where 24 FPS is our base frame-per-second.

Table 2 lists the three Cosmos 3 variants (Edge, Nano, Super) and their key hyperparameters, illustrating how the same MoT design scales from on‑device to datacenter sizes.

Data Composition and Synthetic Generation

The section details the progressive multi‑stage data curricula for Reasoner and Generator, highlighting synthetic data’s role in long‑tail physical scenarios.

The Reasoner and Generator pathways share a transformer backbone but consume distinct data streams. The Reasoner learns vision‑language understanding, while the Generator learns multimodal synthesis.

SDG‑PhyxSim injects procedurally generated physical scenes into the training mix so the model sees rare interactions that real data rarely cover.

At $t=0$, the cube rests on the ramp; the sphere is above the ramp.

Apply gravity: the sphere falls 0.5 m (distance = ½ $g t^2$) reaching the ramp at $t=0.32$ s.

Collision response pushes the cube 0.2 m up the ramp while the sphere bounces back.

Record RGB frame, depth map, and action vector (forces on each object) for each timestep.

After 1 s the scene stabilizes; the final action vector shows zero net forces.

Even this minimal setup produces a rich multimodal sample (video, depth, segmentation, forces) that real datasets rarely capture, illustrating how synthetic data fills long‑tail physical gaps.

The Reasoner curriculum starts with a broad vision‑language pre‑training mixture (≈22 M samples) dominated by OCR and 2D grounding, then shifts to a fine‑tuning mix (≈2.2 M samples) where video‑text reaches 50 % to boost spatiotemporal reasoning.

**Figure 7.** Cosmos 3 Reasoner data composition by capability category. We summarize the curated data mixture used to train Cosmos 3 Reasoner across the pre-training and supervised fine-tuning stages. The mixture contains 22.0M pre-training samples and 2.2M supervised fine-tuning samples spanning image-text, video-text, and text-only categories, with each ring showing the relative contribution of major capability streams such as OCR, visual question answering, reasoning, captioning, grounding, and instruction tuning.

The Generator follows a three‑stage curriculum: massive image/video pre‑training, mid‑training that adds high‑quality and synthetic data plus control‑signal video transfer, and post‑training for domain‑specialized experts.

Progressively augment the generation data pool so the model first learns generic visual synthesis, then acquires domain‑specific control and physical realism.

**Figure 8.** Generator data curriculum. Each row is a training mode; each column is a training stage. Colored cells show the number of training samples used at that stage; gray cells (—) indicate the mode is not active. The video row covers text-to-video, image-to-video, and video-to-video continuation; V2V uses clean conditioning-video prefixes and noisy future-video targets. Action and video transfer data are first introduced during mid-training. Mid-training yields the base Cosmos3-Nano and Cosmos3-Super models (shown between the Mid-training and Post-training columns), which then enter post-training. Post-training is conducted independently for each modality, yielding the specialized models listed on the right: Cosmos3-Super-Text2Image, Cosmos3-Super-Image2Video, and Cosmos3-Nano-Policy-DROID. We note that these specialized models share the exact same architecture with their corresponding mid-train models.

Action data is organized into four pillars—egocentric motion, autonomous vehicle, robotics, and camera motion—totaling 8.4 M episodes and 61.3 K hours.

**Figure 9: Action data distribution.** Hours are aggregated over the four main action-data pillars in the final curated action mid-training set, which contains 8.4M episodes and 61.3K hours.

Table 3 breaks down modality counts (Image‑text 18.8 M, Video‑text 1.08 M, Text‑only 0.13 M). Table 4 details robotics data per embodiment, with AgiBot contributing the most hours (4.37 K).

How does SDG‑PhyxSim differ from generic synthetic image generators?

SDG‑PhyxSim produces fully synchronized multimodal streams (RGB, depth, segmentation, and explicit action vectors) derived from a physics simulation, whereas typical image generators only output static pictures without any underlying dynamics or control signals.

Synthetic data, injected via SDG‑PhyxSim, is essential for covering long‑tail physical scenarios that real collections cannot provide.

Training Phases and Curriculum

Cosmos 3 first builds a multimodal Reasoner, then transfers it to a Generator trained with a progressive curriculum.

Cosmos 3 is trained in two phases. First a multimodal Reasoner is built on large‑scale image–text and video–text data; its weights then seed a Generator that learns to synthesize pixels, audio, and actions through a progressive curriculum.

Start with broad multimodal pre‑training, then gradually add higher‑resolution video, action, and transfer data so the model’s capabilities grow step by step.

Why not train the Generator on a single resolution and then upscale?

Training on a single resolution forces the model to learn a narrow distribution of pixel statistics; the progressive schedule preserves diversity across scales and lets the same parameters learn resolution‑specific shift values, which improves fidelity at all tiers.

Pre‑training: jointly train a language model, a ViT encoder, and a multimodal projector on image–text and video–text corpora for two epochs using a no‑replacement sampler.

Enforce a maximum sequence length of 16 k tokens (2048 image tokens, 8192 video tokens) to keep inference latency low.

Apply square‑root normalized per‑token loss weighting to balance short and long sequences.

Fine‑tune: supervised training on a curated Physical AI mixture with a 1:4 pre‑training‑to‑SFT sampling budget, plus an 800 K instruction‑following subset.

Use AdamW ($\beta$₁=0.9, $\beta$₂=0.95), weight decay 0.1, cosine decay to 0.1× peak after 1 000 warm‑up steps, and gradient clipping at norm 1.0.

The remaining 5 760 tokens are occupied by text and positional embeddings.

During the forward pass the model processes the image tokens first, then interleaves video tokens, respecting the 16 k overall cap.

Gradient clipping at norm 1.0 ensures that any single sample’s loss cannot produce an exploding update, even with the long video segment.

This cap forces the model to learn compact representations for long video streams, which later benefits the Generator’s multi‑resolution training.

Train simultaneously on 256 p, 480 p, and 720 p streams, packing heterogeneous sequences into a fixed 74 000‑token window so the model never sees padding.

Why does the 720 p tier use a larger shift value (s = 5) than the 256 p tier?

Higher resolution images contain more high‑frequency content; a larger shift pushes the diffusion process farther into the noisy regime, giving the model more opportunity to learn fine‑grained reconstruction.

Domain specialization: ingest 15.6 M high‑quality image samples and 74.7 M curated video clips covering robotics, driving, and physics.

Multimodal integration: keep the clean‑prefix/noisy‑target formulation while adding action tokens and control‑signal tokens to the diffusion subsequence.

Action data: train the model to predict future video conditioned on action sequences, and to infer actions from observed trajectories.

Video‑transfer data: provide clean control signals (edge, depth, segmentation) as inputs and denoise the corresponding target video.

Scale the action loss by a factor of 10 to balance its magnitude against visual losses.

Multiplying the action loss by 10 yields 0.20, comparable to the visual loss magnitude.

The optimizer then treats both modalities equally during back‑propagation.

Gradient clipping at norm 1.0 caps the combined update, preventing instability.

Without scaling, the action signal would be drowned out, leading to poor action generation despite strong visual performance.

Blend a clean latent with Gaussian noise, then train a denoiser to predict the constant velocity that would move the noisy sample back to the clean target.

Compute $x_\sigma = 0.3\times2 + 0.7\times5 = 0.6 + 3.5 = 4.1$.

The denoiser’s target velocity is $\epsilon=5$, so the loss is $(v_\theta - 5)^2$ masked on non‑conditioning tokens.

For $t=0.5$, the shifted noise level $s(0.5)=0.5/(1+0.5)=0.333$, confirming a moderate noise regime.

Even a modest $\sigma$ injects enough noise to make the denoising task non‑trivial, while the shifted schedule ensures later steps focus on finer details.

Why mask the MSE loss on conditioning tokens instead of training the model to reconstruct them?

Conditioning tokens already provide the correct information; penalizing them would force the model to waste capacity learning an identity mapping, reducing its ability to model the truly unknown noisy portion.

**Figure 10.** Left: Multi-resolution training and sequence packing. The three resolution tiers (256p, 480p, 720p) differ in their maximum frame budget, eligible source material, and rectified-flow noise-shift value; variable-length sequences from different tiers are packed together to fill a fixed 74,000-token context window, maximizing GPU utilization without padding. Right: Data mixture used in generator pre-training. We use joint image-video training, with videos sampled 80% of the time and images the remaining 20%. Within each split, we train at multiple resolutions: 256p, 480p, and 720p. For video batches, we additionally sample uniformly among three conditioning modes—text-to-video, image-to-video, and video-to-video. The exact data mixture is shown in the right panel.

**Table 5.** **Image/Video Model Specifications.** Supported configurations for image and video modalities. Each row shows the FPS range, frame counts (video only), and image/video dimensions (w, h) for the five supported aspect ratios at each resolution.

The table outlines the distribution of training streams and their corresponding modes or conditioning techniques, along with their respective shares.

Infrastructure and Data Loading

Describes the unified infrastructure and the Joint Data‑Loader that packs multimodal samples efficiently.

Heterogeneous multimodal samples produce token counts that vary by two orders of magnitude, causing massive padding waste, workload imbalance, and NCCL timeouts in naïve distributed training.

The loader unifies per‑stream queues into a single rank‑synchronous batch, greedily packing samples until a strict token budget is hit, then using a short look‑ahead to fill any leftover capacity.

Start with an empty packed sequence (0 tokens used).

Take the first sample (length 5) → packed tokens = 5 (still ≤ 12).

Next sample (length 3) fits → packed tokens = 8.

Third sample (length 4) would exceed the budget (8 + 4 > 12), so it is moved to the look‑aside buffer.

Scan forward: the next sample in Stream A is length 4 (still too large) → skip.

Switch to Stream B (because Stream A is exhausted); first sample length 2 fits → packed tokens = 10.

Next Stream B sample length 6 exceeds remaining budget (10 + 6 > 12) → look‑aside.

Final Stream B sample length 1 fits → packed tokens = 11 (≤ 12) and sample count = 4 (reaches $N_{\text{max}}$).

Iteration ends; look‑aside buffer now holds the two skipped samples (length 4 from A and length 6 from B), which are returned to the head of their respective stream buffers for the next iteration.

Look‑ahead packing salvages otherwise wasted token capacity, turning a potential 1‑token idle slot into a useful sample without breaking deterministic rank‑synchrony.

Questions & answers

What is the main contribution of Cosmos 3?

Cosmos 3 introduces a single Mixture-of-Transformers (MoT) backbone that treats language, image, video, audio, and action as a single interleaved token stream, replacing the fragmented stack of specialized models—VLMs, video generators, and action models—previously required for Physical AI pipelines.

What problem does Cosmos 3 address?

Current Physical AI pipelines stitch together separate vision, language, and action models, which is computationally inefficient, creates inconsistent representations, and requires costly data pipelines. Cosmos 3 eliminates this fragmentation by unifying all modalities in one backbone.

Why is a unified omnimodal model important for Physical AI?

Training Physical AI agents directly in the real world is prohibitively slow, costly, and risky, so agents rely on simulated environments. A unified backbone like Cosmos 3 allows the same network to handle understanding, generation, and world-action modes, simplifying the development of embodied agents.

How does the Mixture-of-Transformers (MoT) architecture work?

MoT uses a dual-tower design that routes autoregressive (AR) reasoning tokens and diffusion-based generation tokens through independent parameter sets, while the two towers interact via a dual-stream joint attention mechanism so the generator can condition on the reasoner's latent state without the reasoner being updated by the generator.

Why does Cosmos 3 use separate parameter towers instead of a single monolithic transformer?

AR reasoning requires causal attention to prevent information leakage during sequential token generation, while diffusion denoising benefits from full bidirectional attention to reconstruct corrupted tokens. Sharing parameters would force a harmful trade-off between these two inductive biases; separate towers preserve each bias while still allowing cross-attention between streams.

How does Cosmos 3 handle different physical embodiments such as robots versus vehicles?

The model uses domain-specific linear projection layers that map heterogeneous control inputs—such as a vehicle's ego pose or a robot's ego, effector, and grasp states—into a common 64-dimensional latent space, allowing the model to learn shared geometric priors while maintaining embodiment-specific semantics.

How does Cosmos 3 handle the different temporal resolutions of video, audio, and action data?

What is 3D MRoPE and how does it differ from standard 1D RoPE?

Standard RoPE encodes a single scalar position per token, sufficient when only temporal order matters. 3D MRoPE adds two spatial axes, allowing the model to distinguish tokens that are temporally aligned but spatially separated (e.g., different image patches) and to treat modalities with only temporal information as points in the spatial plane.

What generation modes does Cosmos 3 support?

Cosmos 3 supports three generation modes defined by which tokens are treated as clean versus noisy: forward dynamics (denoising vision conditioned on clean actions), inverse dynamics (denoising actions conditioned on clean vision), and policy mode (denoising both vision and actions simultaneously).

What data does Cosmos 3 use for training?

The Reasoner is trained on approximately 22 million samples for pre-training (dominated by OCR and 2D grounding) and approximately 2.2 million samples for fine-tuning. The Generator follows a three-stage curriculum including image/video pre-training, mid-training with synthetic data, and domain-specialized post-training. Action data totals 8.4 million episodes and 61,300 hours across four pillars: egocentric motion, autonomous vehicle, robotics, and camera motion.

What is SDG-PhyxSim and why is it used?

SDG-PhyxSim is a physics simulation-based synthetic data generator that produces fully synchronized multimodal streams including RGB, depth, segmentation, and explicit action vectors. Unlike generic image generators that output only static pictures, SDG-PhyxSim provides underlying dynamics and control signals, and is used to cover long-tail physical scenarios that real-world data collections cannot provide.

How is Cosmos 3 evaluated?

The Reasoner is evaluated on 48 benchmarks grouped into General, Robotics, Smart Infrastructure, and Driving domains using VLMEvalKit, comparing Edge, Nano, and Super variants against open-source and closed-source competitors. The Generator is assessed across five evaluation families: Image Generation (UniGenBench, CVTG, HPSv3, Aesthetic V2), Video Generation (PAIBench-G, RBench, Physics-IQ), Audio-Visual Generation (Cosmos-SoundBench), Transfer Generation (PAIBench-C, AVBench-C), and Action Generation.

What are the three model variants of Cosmos 3?

Cosmos 3 comes in three variants—Edge, Nano, and Super—which scale the same MoT design from on-device to datacenter sizes, as detailed in Table 2 of the paper. The paper does not specify the exact parameter counts for each variant in the provided text.

What key results does Cosmos 3 achieve?

The paper claims Cosmos 3 establishes a new state-of-the-art across understanding and generation tasks, enabling a single model to function as a VLM, world simulator, and policy model. Specific numerical benchmark scores are referenced in Tables 10 and related evaluation sections, but the provided text does not enumerate individual metric values.

How does Cosmos 3 compare to prior omnimodal and Physical AI models?

Prior omnimodels such as Unified-IO 2, Chameleon, and GPT-4o focus on text-image or general media generation and leave physical dynamics and action conditioning underexplored. Cosmos 3 extends the omnimodal paradigm by pairing an autoregressive reasoner tower with a diffusion generator tower and explicitly incorporating action as a first-class modality alongside audio.

How does Cosmos 3 relate to prior action modeling systems like RT-2, GAIA-1, and VPT?

Prior systems specialize in one action modeling mode—GAIA-1 and DriveDreamer for forward dynamics, VPT for inverse dynamics, and RT-1/RT-2 and OpenVLA for policy learning. Cosmos 3 unifies all three as conditioning patterns of a single multimodal backbone without requiring separate architectures.

What are the limitations or open challenges acknowledged by Cosmos 3?

The paper does not explicitly enumerate limitations in the provided text, though it acknowledges that existing pipelines are fragmented and that long-tail physical scenarios require synthetic data coverage via SDG-PhyxSim, implying real-world data alone is insufficient. The paper does not discuss failure modes, safety considerations, or generalization bounds.

How does the training infrastructure handle heterogeneous multimodal data?

Heterogeneous multimodal samples produce token counts that vary by two orders of magnitude, causing padding waste, workload imbalance, and NCCL timeouts in naive distributed training. The paper introduces a Joint Data-Loader that sustains high GPU utilization while respecting the multimodal token budget and preserving deterministic distributed semantics.

Who produced Cosmos 3 and where was it published?

The paper is titled 'Cosmos 3: Omnimodal World Models for Physical AI' and is available on arXiv at arxiv.org/abs/2606.02800. The paper does not specify individual author names or a conference venue in the provided text.

How does Cosmos 3 handle prompt input for generation tasks?

Cosmos 3 uses a prompt-upsampling pipeline described in Appendix B, where LoRA-fine-tuned Qwen3-VL-8B models translate sparse user-level requests into full structured JSON schemas required by the Generator, with separate upsamplers for text-to-image, image-to-video, and general Reasoner-based upsampling.

Key terms

Physical AI: AI systems designed to perceive, reason about, and act within the physical world, typically encompassing embodied agents such as robots and autonomous vehicles.
Mixture-of-Transformers (MoT): An architecture that routes different types of tokens (e.g., reasoning vs. generation) through separate sets of transformer parameters while allowing information exchange via shared attention, avoiding the parameter interference of a single monolithic transformer.
Omnimodal: Capable of jointly processing and generating multiple modalities—such as language, image, video, audio, and action—within a single unified model.
Autoregressive (AR) reasoning: A token generation approach where each output token is predicted sequentially based on all previously generated tokens, requiring causal (left-to-right) attention to prevent information leakage.
Diffusion-based generation: A generative approach that learns to iteratively denoise a corrupted (noisy) representation back into a clean output, benefiting from bidirectional attention to access full context during reconstruction.
Dual-stream joint attention: An attention mechanism that allows the AR reasoner tower and the diffusion generator tower to exchange information with each other while maintaining their separate parameter sets.
Absolute Temporal Modulation: A technique that aligns tokens from modalities with different sampling rates (e.g., video at 30 fps vs. audio at a higher rate) onto a single shared physical time axis by scaling each modality's temporal index increment by its tokens-per-second (TPS) rate.
3D MRoPE (3D Multimodal Rotary Position Embedding): An extension of standard 1D rotary position encoding that adds two spatial axes, enabling the model to encode both temporal position and spatial location (height and width) of tokens such as image patches.
RoPE (Rotary Position Embedding): A positional encoding method for transformers that encodes token position by rotating query and key vectors, allowing the model to represent relative distances between tokens.
Forward dynamics: A generation mode in which the model predicts future visual observations conditioned on known action inputs.
Inverse dynamics: A generation mode in which the model infers the actions that caused a given visual transition, conditioned on observed before-and-after frames.
Policy mode: A generation mode in which the model simultaneously denoises both visual observations and actions, effectively acting as a full policy that maps goals to actions.
SDG-PhyxSim: A physics simulation-based synthetic data generator used by Cosmos 3 that produces synchronized multimodal streams—including RGB, depth, segmentation, and explicit action vectors—to cover long-tail physical scenarios.
Vision-Language Model (VLM): A model that combines visual perception with language understanding, enabling tasks such as image captioning, visual question answering, and scene description.
Vision-Language-Action (VLA) model: A model that extends VLMs to also predict physical actions, enabling embodied agents to act based on visual and language inputs.
Joint Data-Loader: A custom data-loading infrastructure in Cosmos 3 that packs heterogeneous multimodal samples to respect a token budget, maintain GPU utilization, and preserve deterministic distributed training semantics.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that inserts small trainable low-rank matrices into a pre-trained model, allowing task-specific adaptation without updating all original weights.
VLMEvalKit: An evaluation toolkit used in the paper to benchmark vision-language model performance across multiple standardized tasks and datasets.
Prompt upsampling: A pipeline step in Cosmos 3 that automatically expands sparse user prompts into detailed structured JSON schemas required by the Generator, using a fine-tuned language model.
Causal attention: An attention mechanism in which each token can only attend to itself and previous tokens, preventing the model from seeing future information during sequential generation.
Bidirectional attention: An attention mechanism in which each token can attend to all other tokens in the sequence, providing full context useful for reconstruction tasks like diffusion denoising.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers