Cosmos 3: Omnimodal World Models for Physical AI
NVIDIA, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova
Cosmos 3 unifies language, vision, audio, and action into a single Mixture-of-Transformers architecture for Physical AI.
How can a single unified model architecture jointly process and generate language, vision, audio, and action sequences to serve as a foundation for Physical AI?
Physical AI agents currently rely on a fragmented stack of specialized models—Vision-Language Models for reasoning, video generators for simulation, and action models for control—which is computationally inefficient and prevents shared world representations. Cosmos 3 introduces a unified Mixture-of-Transformers architecture that treats language, image, video, audio, and action as a single, interleaved token stream, using separate parameter towers for reasoning and generation while sharing a common latent space. This omnimodal approach establishes a new state-of-the-art across understanding and generation tasks, enabling a single model to function as a VLM, world simulator, and policy model without architectural changes.
Paper Primer
The core mechanism is a dual-tower Mixture-of-Transformers (MoT) design that routes autoregressive reasoning tokens and diffusion-based generation tokens through independent parameter sets. These towers interact via a dual-stream joint attention mechanism, allowing the generator to condition its output on the reasoner's latent state without the reasoner being updated by the generator.
To handle diverse physical embodiments, the model maps heterogeneous control inputs—such as robot gripper states or vehicle steering—into a unified action interface using domain-specific projection layers. This allows the model to learn shared geometric priors across different physical domains while maintaining embodiment-specific semantics.
Cosmos 3 achieves state-of-the-art performance across a diverse suite of Physical AI benchmarks.
The model was ranked as the top open-source Text-to-Image and Image-to-Video model by Artificial Analysis and the leading policy model by RoboArena at the time of the report. Top-tier performance across both generative and embodied reasoning tasks.
Why use a Mixture-of-Transformers (MoT) instead of a single monolithic transformer?
The MoT architecture separates the distinct computational requirements of autoregressive reasoning (which requires causal attention) and iterative denoising generation (which benefits from full bidirectional attention), allowing the model to optimize for both understanding and synthesis without parameter interference.
How does the model handle the different temporal resolutions of video, audio, and action data?
The model uses Absolute Temporal Modulation, which aligns tokens from different modalities onto a shared physical temporal axis by scaling the temporal index increment based on the specific sampling rate (TPS) of each modality.
Researchers can now replace fragmented pipelines with a single, omnimodal backbone that learns shared world dynamics, significantly simplifying the development of embodied agents that require both reasoning and physical interaction.
Abstract
Introducing a unified omnimodal model to close the gap between fragmented modality‑specific systems.
Current Physical AI pipelines stitch together separate vision, language, and action models, incurring costly data pipelines and inconsistent representations. Cosmos 3 proposes a single Mixture‑of‑Transformers backbone that can ingest and emit any combination of language, image, video, audio, or action tokens, eliminating the need for task‑specific model composition.
The Need for Omnimodal World Models
Physical AI needs a unified, scalable model to overcome real‑world training bottlenecks.
Training Physical AI agents directly in the real world is prohibitively slow, costly, and risky, so researchers rely on simulated environments to obtain safe, scalable learning.
Agents that perceive, reason about, and act within the physical world, requiring both accurate perception and forward‑looking planning.
Existing pipelines treat perception and simulation as separate modules—Vision‑Language Models (VLMs) for understanding, video‑generation or forward‑dynamics models for simulation, and Vision‑Language‑Action or World‑Action models for action prediction—leading to fragmented, compute‑heavy systems.
Cosmos 3 unifies these pillars in a single Mixture‑of‑Transformers backbone, jointly modeling language, image, video, audio, and action so the same network can switch between understanding, generation, and world‑action modes without architectural changes.
**Figure 1.** Cosmos 3 serves as a general-purpose backbone for Physical AI. By jointly modeling language, image, video, audio, and action for both understanding and generation, Cosmos 3 serves as a single network architecture for various model classes, including vision-language models, image generation models, audio-visual generation models, policy or world-action models, forward dynamics models, and inverse dynamics models.
**Figure 2.** Cosmos 3 offers a strong starting point for training Physical AI agents. Cosmos 3 can be post-trained on target data for distinct applications without architectural modifications. In this paper, we demonstrate how we post-train Cosmos 3 for better synthetic data generation (Sec. 4.2.3 and Sec. 4.2.4) and better robot policy (Sec. 4.2.5). In the future, we expect Cosmos 3 to play an essential role in generating high-quality, complex environments for training Physical AI agents.
The primary bottleneck for Physical AI agents is the cost and risk of real‑world training; a unified, omnimodal backbone like Cosmos 3 offers a scalable route to acquire both understanding and generation in simulation.
The Mixture-of-Transformers Architecture
Cosmos 3 packs multimodal inputs into a shared token stream and processes them with a dual‑tower transformer.
Cosmos 3 embeds language, vision, audio, and action into a common latent space and feeds the packed sequence to a Mixture‑of‑Transformers backbone that separates reasoning from generation.
MoT splits the transformer into a Reasoner tower for autoregressive tokens and a Generator tower for diffusion tokens, letting each specialize while sharing a common attention matrix.
How does MoT differ from a single transformer that processes all tokens together?
In a single transformer every token shares the same feed‑forward and attention parameters, so the model must learn a compromise between causal prediction and bidirectional denoising. MoT gives each task its own parameters, preserving the optimal inductive bias for each while still allowing the two streams to exchange information through the shared attention matrix.
The Reasoner tower consumes the autoregressive subsequence, producing next‑token predictions for language and vision while respecting causality.
Why not reuse the same parameters for both AR and diffusion tokens?
AR prediction benefits from a causal bias that prevents information leakage, whereas diffusion denoising needs full context to reconstruct corrupted tokens. Sharing parameters would force a trade‑off that harms both tasks; separate towers keep each bias intact while still allowing cross‑attention.
The Generator tower handles the diffusion subsequence, iteratively denoising noisy tokens to reconstruct images, video, audio, and actions.
Why does the Generator need bidirectional attention while the Reasoner stays causal?
The Generator must condition on the entire context (including future AR tokens) to correctly reconstruct a noisy latent; bidirectional attention provides that full view. The Reasoner, by contrast, must generate tokens sequentially, so a causal mask is required to avoid cheating.
Action tokens encode the physical transition between consecutive video states as a compact pseudo‑action built from relative pose and grasp components.
Compute $z = W^{\text{in}}_{\text{robot}}\, x$ → a 64‑dimensional latent vector (each entry is a linear combination of the 9 input components).
The Generator tower processes $z$ through several attention layers, producing an updated latent $z'$.
Decode $x' = W^{\text{out}}_{\text{robot}}\, z' + b^{\text{out}}_{\text{robot}}$, yielding a reconstructed translation of $(0.099, -0.051, 0.001)$ m and negligible rotation.
Even with a tiny toy example, the linear projections learn to preserve the essential geometric information while allowing the shared backbone to operate on a uniform dimensionality.
Why use a domain‑specific linear projection instead of a single shared one?
Different embodiments have heterogeneous action vector lengths and semantics (e.g., a vehicle has only ego pose, a robot has ego + effector + grasp). Separate projections let each domain map its raw vector into the common 64‑dim latent space without wasting capacity on padding or discarding information.
3D MRoPE assigns each token a triplet $(t, h, w)$ so that attention can respect temporal and spatial proximity across modalities.
Language token $i=0$: set $t=h=w=0$.
Video patch tokens $i=1\ldots4$: $t=1$ for all, $h,w$ take values $(0,0), (0,1), (1,0), (1,1)$.
Audio token $i=5$: $t=2$, $h=w=0$.
Action token $i=6$: $t=3$, $h=w=0$.
Even with a minimal example, the triplet encoding cleanly separates modalities: only the video tokens share spatial coordinates, while all others collapse to a single point in the $h\!-\!w$ plane.
How does 3D MRoPE differ from the standard 1D RoPE used for pure language models?
Standard RoPE encodes a single scalar position per token, which suffices when only temporal order matters. 3D MRoPE adds two spatial axes, allowing the model to distinguish tokens that are temporally aligned but spatially separated (e.g., different image patches), and to treat modalities with only temporal information as points in the spatial plane.
Token arrangement follows a simple rule: the autoregressive subsequence appears first, then the diffusion subsequence; within each subsequence, modalities are ordered vision → audio → action, and clean conditioning tokens precede their noisy counterparts.
Generation modes are defined by which tokens are treated as clean versus noisy. Forward dynamics denoises vision conditioned on clean actions; inverse dynamics denoises actions conditioned on clean vision; the policy mode denoises both simultaneously.
**Figure 5.** Mixture-of-Transformers (MoT) architecture of Cosmos 3. Left: a single transformer operates on one token sequence comprising the autoregressive (AR) and diffusion (DM) subsequences: AR carries discrete text tokens and, optionally, ViT-encoded vision tokens, ending with <EOS> and a begin-of-generation token <BOG>, while DM carries continuous tokens from their respective encoders, noise-perturbed during training. Here we visualize all input tokens as noisy for simplicity; for generation modes such as image-to-video or video transfer, clean conditioning tokens precede the noisy targets within DM; see Sec. 2.2.2. Within each transformer block, AR tokens and DM tokens are processed by independent LayerNorms and MLPs (all co-initialized from a pre-trained VLM) and meet only at a shared self-attention operator. Let Q, K, and V be query, key, and value vectors in attention, where the subscript indicates which tower it is in. `Q_AR` attends causally over `K_AR`, `V_AR` only, while `Q_DM` attends bidirectionally over the concatenated [`K_AR`; `K_DM`] and [`V_AR`; `V_DM`]. In this way, diffusion is conditioned on the AR context, while AR remains autoregressively self-contained. Outputs are next-token predictions for Reasoner and denoised tokens for Generator (trained in practice with a flow-matching objective predicting velocity; we show the clean target here for clarity). Right: the attention mask, causal for AR and full for diffusion.
**Figure 3.** **Unified action representation.** We map heterogeneous embodiment controls into compact action vectors built from shared geometric components. Ego and effector motions are encoded as relative-pose pseudo-actions using 3D translation and 6D rotation (an over-parameterized rotation representation by Zhou et al. (2019), as the degree of freedom of rotation is 3), while grasp states directly encode the current manipulation state, such as fingertip positions for hands or gripper open/close values for robots. Domain-aware input and output projections handle heterogeneous action-vector lengths while preserving the shared semantic space.
**Figure 4.** Action sequence configurations. For a video-action data sample, Cosmos 3 constructs different training modes by varying which tokens are clean and which are noisy. The diagram shows a local temporal window in which action tokens lie between adjacent video tokens: $a_t$ connects $v_{t-1}$ to $v_t$, and $a_{t+1}$ connects $v_t$ to $v_{t+1}$. Forward dynamics mode denoises vision tokens conditioned on clean action tokens; inverse dynamics mode denoises action tokens conditioned on clean vision tokens; and video-action (policy) mode denoises both vision and action tokens. Language and special tokens are omitted for compactness.
**Figure 6.** Illustrative coordinate assignment under 3D MRoPE. Left: A packed token sequence containing language, video (two frames, 2 x 2 spatial grid each), audio, and action tokens. Each token receives a (t, h, w) triplet. Language tokens use t = h = w; video tokens vary on all three axes; action and audio tokens use temporal coordinates only (h = w = 0). A modality offset k separates the text and vision temporal ranges. Right: FPS modulation maps frame indices to scaled temporal positions so that equal real-world durations occupy equal position ranges at 16, 24, and 30 FPS, where 24 FPS is our base frame-per-second.
Table 2 lists the three Cosmos 3 variants (Edge, Nano, Super) and their key hyperparameters, illustrating how the same MoT design scales from on‑device to datacenter sizes.
Data Composition and Synthetic Generation
The section details the progressive multi‑stage data curricula for Reasoner and Generator, highlighting synthetic data’s role in long‑tail physical scenarios.
The Reasoner and Generator pathways share a transformer backbone but consume distinct data streams. The Reasoner learns vision‑language understanding, while the Generator learns multimodal synthesis.
SDG‑PhyxSim injects procedurally generated physical scenes into the training mix so the model sees rare interactions that real data rarely cover.
At $t=0$, the cube rests on the ramp; the sphere is above the ramp.
Apply gravity: the sphere falls 0.5 m (distance = ½ $g t^2$) reaching the ramp at $t=0.32$ s.
Collision response pushes the cube 0.2 m up the ramp while the sphere bounces back.
Record RGB frame, depth map, and action vector (forces on each object) for each timestep.
After 1 s the scene stabilizes; the final action vector shows zero net forces.
Even this minimal setup produces a rich multimodal sample (video, depth, segmentation, forces) that real datasets rarely capture, illustrating how synthetic data fills long‑tail physical gaps.
The Reasoner curriculum starts with a broad vision‑language pre‑training mixture (≈22 M samples) dominated by OCR and 2D grounding, then shifts to a fine‑tuning mix (≈2.2 M samples) where video‑text reaches 50 % to boost spatiotemporal reasoning.
**Figure 7.** Cosmos 3 Reasoner data composition by capability category. We summarize the curated data mixture used to train Cosmos 3 Reasoner across the pre-training and supervised fine-tuning stages. The mixture contains 22.0M pre-training samples and 2.2M supervised fine-tuning samples spanning image-text, video-text, and text-only categories, with each ring showing the relative contribution of major capability streams such as OCR, visual question answering, reasoning, captioning, grounding, and instruction tuning.
The Generator follows a three‑stage curriculum: massive image/video pre‑training, mid‑training that adds high‑quality and synthetic data plus control‑signal video transfer, and post‑training for domain‑specialized experts.
Progressively augment the generation data pool so the model first learns generic visual synthesis, then acquires domain‑specific control and physical realism.
**Figure 8.** Generator data curriculum. Each row is a training mode; each column is a training stage. Colored cells show the number of training samples used at that stage; gray cells (—) indicate the mode is not active. The video row covers text-to-video, image-to-video, and video-to-video continuation; V2V uses clean conditioning-video prefixes and noisy future-video targets. Action and video transfer data are first introduced during mid-training. Mid-training yields the base Cosmos3-Nano and Cosmos3-Super models (shown between the Mid-training and Post-training columns), which then enter post-training. Post-training is conducted independently for each modality, yielding the specialized models listed on the right: Cosmos3-Super-Text2Image, Cosmos3-Super-Image2Video, and Cosmos3-Nano-Policy-DROID. We note that these specialized models share the exact same architecture with their corresponding mid-train models.
Action data is organized into four pillars—egocentric motion, autonomous vehicle, robotics, and camera motion—totaling 8.4 M episodes and 61.3 K hours.
**Figure 9: Action data distribution.** Hours are aggregated over the four main action-data pillars in the final curated action mid-training set, which contains 8.4M episodes and 61.3K hours.
Table 3 breaks down modality counts (Image‑text 18.8 M, Video‑text 1.08 M, Text‑only 0.13 M). Table 4 details robotics data per embodiment, with AgiBot contributing the most hours (4.37 K).
How does SDG‑PhyxSim differ from generic synthetic image generators?
SDG‑PhyxSim produces fully synchronized multimodal streams (RGB, depth, segmentation, and explicit action vectors) derived from a physics simulation, whereas typical image generators only output static pictures without any underlying dynamics or control signals.
Synthetic data, injected via SDG‑PhyxSim, is essential for covering long‑tail physical scenarios that real collections cannot provide.
Training Phases and Curriculum
Cosmos 3 first builds a multimodal Reasoner, then transfers it to a Generator trained with a progressive curriculum.
Cosmos 3 is trained in two phases. First a multimodal Reasoner is built on large‑scale image–text and video–text data; its weights then seed a Generator that learns to synthesize pixels, audio, and actions through a progressive curriculum.
Start with broad multimodal pre‑training, then gradually add higher‑resolution video, action, and transfer data so the model’s capabilities grow step by step.
Why not train the Generator on a single resolution and then upscale?
Training on a single resolution forces the model to learn a narrow distribution of pixel statistics; the progressive schedule preserves diversity across scales and lets the same parameters learn resolution‑specific shift values, which improves fidelity at all tiers.
Pre‑training: jointly train a language model, a ViT encoder, and a multimodal projector on image–text and video–text corpora for two epochs using a no‑replacement sampler.
Enforce a maximum sequence length of 16 k tokens (2048 image tokens, 8192 video tokens) to keep inference latency low.
Apply square‑root normalized per‑token loss weighting to balance short and long sequences.
Fine‑tune: supervised training on a curated Physical AI mixture with a 1:4 pre‑training‑to‑SFT sampling budget, plus an 800 K instruction‑following subset.
Use AdamW ($\beta$₁=0.9, $\beta$₂=0.95), weight decay 0.1, cosine decay to 0.1× peak after 1 000 warm‑up steps, and gradient clipping at norm 1.0.
The remaining 5 760 tokens are occupied by text and positional embeddings.
During the forward pass the model processes the image tokens first, then interleaves video tokens, respecting the 16 k overall cap.
Gradient clipping at norm 1.0 ensures that any single sample’s loss cannot produce an exploding update, even with the long video segment.
This cap forces the model to learn compact representations for long video streams, which later benefits the Generator’s multi‑resolution training.
Train simultaneously on 256 p, 480 p, and 720 p streams, packing heterogeneous sequences into a fixed 74 000‑token window so the model never sees padding.
Why does the 720 p tier use a larger shift value (s = 5) than the 256 p tier?
Higher resolution images contain more high‑frequency content; a larger shift pushes the diffusion process farther into the noisy regime, giving the model more opportunity to learn fine‑grained reconstruction.
Domain specialization: ingest 15.6 M high‑quality image samples and 74.7 M curated video clips covering robotics, driving, and physics.
Multimodal integration: keep the clean‑prefix/noisy‑target formulation while adding action tokens and control‑signal tokens to the diffusion subsequence.
Action data: train the model to predict future video conditioned on action sequences, and to infer actions from observed trajectories.
Video‑transfer data: provide clean control signals (edge, depth, segmentation) as inputs and denoise the corresponding target video.
Scale the action loss by a factor of 10 to balance its magnitude against visual losses.
Multiplying the action loss by 10 yields 0.20, comparable to the visual loss magnitude.
The optimizer then treats both modalities equally during back‑propagation.
Gradient clipping at norm 1.0 caps the combined update, preventing instability.
Without scaling, the action signal would be drowned out, leading to poor action generation despite strong visual performance.
Blend a clean latent with Gaussian noise, then train a denoiser to predict the constant velocity that would move the noisy sample back to the clean target.
Compute $x_\sigma = 0.3\times2 + 0.7\times5 = 0.6 + 3.5 = 4.1$.
The denoiser’s target velocity is $\epsilon=5$, so the loss is $(v_\theta - 5)^2$ masked on non‑conditioning tokens.
For $t=0.5$, the shifted noise level $s(0.5)=0.5/(1+0.5)=0.333$, confirming a moderate noise regime.
Even a modest $\sigma$ injects enough noise to make the denoising task non‑trivial, while the shifted schedule ensures later steps focus on finer details.
Why mask the MSE loss on conditioning tokens instead of training the model to reconstruct them?
Conditioning tokens already provide the correct information; penalizing them would force the model to waste capacity learning an identity mapping, reducing its ability to model the truly unknown noisy portion.
**Figure 10.** Left: Multi-resolution training and sequence packing. The three resolution tiers (256p, 480p, 720p) differ in their maximum frame budget, eligible source material, and rectified-flow noise-shift value; variable-length sequences from different tiers are packed together to fill a fixed 74,000-token context window, maximizing GPU utilization without padding. Right: Data mixture used in generator pre-training. We use joint image-video training, with videos sampled 80% of the time and images the remaining 20%. Within each split, we train at multiple resolutions: 256p, 480p, and 720p. For video batches, we additionally sample uniformly among three conditioning modes—text-to-video, image-to-video, and video-to-video. The exact data mixture is shown in the right panel.
**Table 5.** **Image/Video Model Specifications.** Supported configurations for image and video modalities. Each row shows the FPS range, frame counts (video only), and image/video dimensions (w, h) for the five supported aspect ratios at each resolution.
The table outlines the distribution of training streams and their corresponding modes or conditioning techniques, along with their respective shares.
Infrastructure and Data Loading
Describes the unified infrastructure and the Joint Data‑Loader that packs multimodal samples efficiently.
Heterogeneous multimodal samples produce token counts that vary by two orders of magnitude, causing massive padding waste, workload imbalance, and NCCL timeouts in naïve distributed training.
The loader unifies per‑stream queues into a single rank‑synchronous batch, greedily packing samples until a strict token budget is hit, then using a short look‑ahead to fill any leftover capacity.
Start with an empty packed sequence (0 tokens used).
Take the first sample (length 5) → packed tokens = 5 (still ≤ 12).
Next sample (length 3) fits → packed tokens = 8.
Third sample (length 4) would exceed the budget (8 + 4 > 12), so it is moved to the look‑aside buffer.
Scan forward: the next sample in Stream A is length 4 (still too large) → skip.
Switch to Stream B (because Stream A is exhausted); first sample length 2 fits → packed tokens = 10.
Next Stream B sample length 6 exceeds remaining budget (10 + 6 > 12) → look‑aside.
Final Stream B sample length 1 fits → packed tokens = 11 (≤ 12) and sample count = 4 (reaches $N_{\text{max}}$).
Iteration ends; look‑aside buffer now holds the two skipped samples (length 4 from A and length 6 from B), which are returned to the head of their respective stream buffers for the next iteration.
Look‑ahead packing salvages otherwise wasted token capacity, turning a potential 1‑token idle slot into a useful sample without breaking deterministic rank‑synchrony.