Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA unifies manipulation, navigation, and egocentric action modeling into a single foundation model.

How can we unify diverse robotic manipulation, navigation, and vision-language tasks into a single generalist model that generalizes across different robot embodiments?

Embodied AI is currently fragmented, with specialized models for specific robots or tasks that fail to generalize across different environments and embodiments. Qwen-VLA addresses this by using a shared vision-language backbone and a flow-matching action decoder that treats all embodied tasks as a unified action-and-trajectory prediction problem. The model achieves state-of-the-art performance across diverse benchmarks, including 97.9% on LIBERO and 76.9% average success in real-world OOD manipulation.

Paper Primer

The core mechanism is a staged training recipe that separates action-prior compression from visual grounding. By first training the action decoder on text-to-action (T2A) sequences without images, the model learns a structured motor prior before it ever sees a visual observation.

To handle heterogeneous robot platforms, the model uses embodiment-aware prompt conditioning. A textual description of the robot's configuration and control convention is prepended to the input, allowing a single set of model weights to adapt to different action dimensions and control frequencies.

Qwen-VLA achieves high-performance generalization across diverse task families and robot embodiments.

The model was evaluated on LIBERO, Simpler-WidowX, RoboTwin, R2R, and RxR benchmarks, alongside real-world ALOHA experiments. It attains 97.9% on LIBERO, 73.7% on Simpler-WidowX, and 76.9% average OOD success in real-world ALOHA manipulation.

Why use a staged training recipe instead of training the whole model end-to-end from the start?

The VLM backbone and the randomly initialized action decoder enter training in asymmetric states. Staged training prevents noisy gradients from a fresh decoder from destabilizing the pretrained backbone and allows the decoder to learn action structure before attempting visual grounding.

How does the model handle different robot platforms without using separate output heads?

The model uses a unified action-and-trajectory prediction space where all control signals are represented as sequences of real-valued vectors. Embodiment-aware prompts inform the model of the specific control convention, and a binary mask prevents padded dimensions from influencing the gradient.

Researchers can now treat manipulation, navigation, and trajectory prediction as a single unified task, enabling cross-embodiment transfer and scaling embodied learning through large-scale joint pretraining.

Abstract and Overview

We present Qwen-VLA, a single model that unifies manipulation, navigation, and vision‑language tasks.

Current embodied AI relies on separate models for each task, which limits generalization. Qwen‑VLA unifies perception, reasoning, and continuous action in a single vision‑language‑action model, and it also uses embodiment‑aware prompt conditioning to specify robot morphology.

When each capability—manipulation, navigation, or visual reasoning—is handled by its own specialized model, the overall system cannot share knowledge across tasks or robot forms.

Isn't it enough to just ensemble the specialized models?

Ensembling still keeps the models isolated; it cannot create shared representations or a common action prior, whereas Qwen‑VLA learns a single set of parameters that jointly encode perception, language, and motor control, enabling transfer across tasks and embodiments.

**Figure 1.** Overview of Qwen-VLA, a unified embodied model trained on mixed manipulation, navigation, and vision-language understanding data to generate both robot actions and textual responses.

Qwen‑VLA demonstrates that a single VLA model can serve as a generalist policy for many robotic tasks.

Introduction and Contributions

We frame the fragmentation of embodied intelligence and outline our unified VLA solution.

Embodied intelligence systems are fragmented: each robot platform, task family, or dataset typically receives a dedicated model, preventing knowledge transfer across embodiments. This specialization hampers scaling because the same visual‑language reasoning that fuels large‑scale VLM pretraining cannot be leveraged for diverse control problems.

Our approach unifies manipulation, navigation, and egocentric action modeling within a single Vision‑Language‑Action (VLA) architecture. By conditioning on a textual description of the robot platform (the embodiment‑aware prompt) and sharing a common action‑and‑trajectory space, Qwen‑VLA can ingest heterogeneous supervision while preserving a single inference interface.

Compute per‑trajectory size: $7$ joints × $200$ steps × $4$ bytes = $5{,}600$ bytes.

Scale to batch of $64$: $5{,}600$ bytes × $64 = 358{,}400$ bytes ≈ $360$ KB.

For a $20$‑DoF, $1{,}000$‑step trajectory: $20$ × $1{,}000$ × $4$ bytes = $80{,}000$ bytes per sample.

Extrapolate to a batch that would materialize the full attention map (e.g., $2{,}500$ samples): $80{,}000$ bytes × $2{,}500 \approx 200$ GB.

This toy calculation shows why a naïve implementation would exhaust GPU memory, motivating the need for a compressed action prior and a unified decoder that can handle diverse dimensionalities without materializing the full map.

Problem Formulation

We cast diverse embodied tasks as a single conditional prediction problem.

Embodied tasks such as manipulation, navigation, and trajectory forecasting all share a common bottleneck: they must translate language and visual cues into concrete future actions. A single model that treats each task as a conditional sequence prediction can exploit this commonality.

All embodied tasks are expressed as “given the current visual observation, language instruction, and embodiment description, predict the next $H$ steps of the appropriate action or trajectory.”

Step 1: Encode the image and instruction, concatenate with the embodiment prompt, producing a conditioning vector $c$.

Step 2: Feed $c$ into the decoder; the first token predicts the end‑effector pose for grasping (position $(0.45, 0.12, 0.03)$ m).

Step 3: Condition on the first predicted pose and generate the second token – the lift motion (position $(0.45, 0.12, 0.15)$ m).

Step 4: Condition on the lift pose and generate the third token – the place pose (position $(0.80, 0.30, 0.03)$ m).

Even though the underlying action space differs (joint angles vs Cartesian poses), the same three‑step decoding process works because all actions are expressed in a common latent representation.

How does this unified prediction differ from a standard language‑to‑action model that only takes text as input?

Standard models ignore the visual observation $\mathbf{o}_t$ and thus cannot ground language in the current scene. Our formulation explicitly conditions on $\mathbf{o}_t$, so the same instruction “pick the red block” will produce different actions depending on where the red block actually appears in the image.

We use a multimodal transformer that interleaves visual tokens with text tokens, allowing a single sequence to capture both modalities from the first layer.

Why not keep visual and language streams separate and fuse them later?

Separate streams require an additional cross‑attention layer that scales quadratically with the combined sequence length. Early interleaving lets the model share the same attention matrix throughout, cutting both compute and latency while still learning rich multimodal interactions.

A DiT‑style diffusion model treats actions as noisy samples and learns to denoise them step‑by‑step, yielding precise trajectories after a few Euler integration steps.

How is flow‑matching different from ordinary diffusion‑based action generation?

Ordinary diffusion predicts the denoised action after many stochastic reverse steps, which is slow at inference. Flow‑matching directly learns the deterministic velocity field, so the expert can produce a clean action after just a few deterministic Euler steps, dramatically reducing latency.

Embodiment-aware Prompt Conditioning

We unify diverse robot actions into a single masked tensor conditioned on an embodiment prompt.

Multiple robot embodiments each have distinct hardware and control conventions, which makes training a single model difficult. A unified representation must respect these differences without proliferating separate output heads.

We prepend a short textual description of the robot’s hardware and control settings to each example so the model knows which embodiment it must act for.

Tokenize the prompt → a sequence of 23 tokens.

Pass tokens through the VLM backbone → hidden states H₁,…,H₂₃.

Obtain the noisy action chunk (e.g., two 4‑dim vectors) → A₁, A₂.

Concatenate H₁…H₂₃ with A₁, A₂ → combined input I.

Feed I to the DiT → model predicts the next action chunk conditioned on the embodiment.

The prompt supplies all necessary embodiment details; the model never needs a separate robot‑ID embedding because the full description is already encoded in the token stream.

Why not just add a one‑hot robot identifier instead of a full textual prompt?

Because a one‑hot ID conveys only the robot’s identity, not its configuration (e.g., single vs. dual arms, presence of a waist or mobile base) or control frequency. The full prompt embeds these details as natural language, allowing the backbone to reason about them jointly.

All robot actions, whether manipulation or navigation, are expressed as a fixed‑size tensor $Y$ where only the first c channels carry real values and the rest are zero‑padded; a binary mask M marks which entries are valid.

Y = [[0.5, ‑0.2, 0, 0], [0.6, ‑0.1, 0, 0], [0.4, ‑0.3, 0, 0]] (3 × 4 matrix).

Mask M = [[1, 1, 0, 0], [1, 1, 0, 0], [1, 1, 0, 0]] indicating the first two channels are valid at every time step.

During loss computation, only the entries where M = 1 are considered; the zero‑padded columns are ignored.

The same Y/M pair can be used for a navigation sample with c = 3 channels by filling the third column and leaving the fourth zero‑padded.

The mask cleanly separates real signal from padding, so a single set of parameters can learn across heterogeneous action spaces without interference.

Why not allocate separate output heads for each robot type instead of using a mask?

Separate heads would require a different parameter set per embodiment, exploding model size and preventing knowledge sharing. The mask lets one shared head predict all channels while ignoring irrelevant dimensions, preserving capacity and enabling transfer across embodiments.

Training Objectives

The model learns via a joint objective balancing continuous action flow-matching and multimodal next-token prediction.

To unify cognitive reasoning with motor control, we train the model end-to-end using a joint objective. This approach balances continuous action generation—which requires precise spatial coordination—with vision-language understanding, which preserves the model's reasoning capabilities during embodied co-training.

Instead of predicting absolute action values, the model learns to predict the "velocity" required to transform noise into a clean action trajectory. This treats action generation as a continuous flow, allowing the model to refine complex motor sequences through iterative integration.

To prevent the model from losing its reasoning skills while learning motor control, we maintain a standard next-token prediction loss on auxiliary vision-language data. This $L_{\text{vl}}$ objective acts as a regularizer, ensuring the backbone remains grounded in perception and language even under heavy embodied co-training.

Two-Stage Pretraining Recipe

Four stages progressively equip the model with a language‑driven action prior and then ground it in vision.

Jointly training a fresh motor decoder with a heavily pretrained vision‑language backbone is like pairing a rookie apprentice with a veteran master: the apprentice’s noisy gradients can destabilize the master before it has learned any useful skill.

Freeze the vision‑language backbone and force the DiT decoder to reconstruct full‑dimensional action trajectories from only the compressed language and embodiment prompt.

How does T2A differ from ordinary language‑to‑action imitation learning?

In standard imitation, the model sees both language and visual observations while learning to mimic actions. T2A removes the visual channel entirely, so the decoder must infer the full motion solely from the compressed description, producing a language‑indexed prior rather than a direct imitation mapping.

Step 1: The decoder embeds the three words and the prompt, producing a 6‑dimensional latent vector.

Step 2: Using flow‑matching, the decoder predicts the first joint pair (0, 0) from the latent vector.

Step 3: The latent vector is updated with the predicted joint pair, and the decoder predicts the next pair (10, 5).

Step 4: Repeating the update yields the full sequence [(0, 0), (10, 5), (20, 10), (30, 15)].

The decoder learns to generate a coherent trajectory solely from the language description, establishing a reusable prior that can later be conditioned on visual cues.

Unfreeze both backbone and decoder, then train on heterogeneous visual‑language‑action data so the decoder’s language‑driven prior is anchored to real visual observations.

Why can’t we simply start joint training with both modules random?

Randomly initializing the decoder would force it to learn both the language‑driven prior and visual grounding simultaneously, leading to noisy gradients that destabilize the already‑pretrained backbone. CPT avoids this by first establishing a solid prior in T2A.

Step 1: The pretrained backbone maps the 4‑pixel image to a 8‑dimensional visual embedding.

Step 2: The language encoder produces a 6‑dimensional instruction embedding.

Step 3: The decoder concatenates visual, language, and prompt embeddings, then predicts the first joint angle.

Step 4: The loss compares the predicted angle to the ground‑truth (e.g., 15° for “push” and 30° for “lift”).

Step 5: Back‑propagation updates both backbone and decoder, aligning visual features with the language‑driven action prior.

CPT quickly aligns visual cues with the already‑structured action prior, so the model learns to ground language in vision without re‑learning the prior from scratch.

Stage I (T2A): Freeze the VLM, train the DiT decoder on language + embodiment prompts only, building a language‑indexed action prior.

Stage II (CPT): Unfreeze both modules, train on joint visual‑language‑action data to ground the prior in images.

Stage III (SFT): Branch from the CPT checkpoint into two tracks—multi‑task fine‑tuning on diverse simulated tasks and teleoperation fine‑tuning on real‑robot data.

Stage IV (RL): Starting from the multi‑task SFT checkpoint, apply reinforcement learning with sparse success rewards in a single simulator to boost closed‑loop task performance.

**Figure 2.** Training recipe of Qwen-VLA. Stage I (T2A) trains the DiT action decoder to reconstruct actions from text alone, building a structured action prior without visual input. Stage II (CPT) unfreezes both modules to ground this prior in visual observations. Stage III (SFT) branches into multi-task and real-robot tracks, and Stage IV (RL) optimizes closed-loop task success via environment rewards.

By first learning a language‑driven action prior and then grounding it, the recipe avoids the instability of naïve joint training while preserving the capacity to adapt to visual inputs later.

Pretraining Data Composition

We assemble a heterogeneous pretraining mixture and condition each example with an embodiment prompt.

The pretraining corpus blends eight data families—robot manipulation, human egocentric demos, navigation, synthetic simulation, and auxiliary vision‑language sources—to give the model broad embodied perception and action generation.

**Table 1.** Pretraining data mixture composition.

We keep each dataset’s native action format and prepend a short prompt that tells the model which robot, arm configuration, and control convention the example follows—like tagging each book with its shelf label so a single retrieval algorithm can fetch items from many shelves without re‑training.

Step 1: Sample a minibatch of 10 examples. Draw 7 from A, 2 from B, 1 from C according to the proportions.

Step 2: Attach prompts—“A‑Manip”, “B‑Ego”, “C‑Sim”—to the respective examples.

Step 3: Feed the batch through the shared encoder; the prompt tokens steer the decoder toward the correct action distribution.

Step 4: Compute loss on each example using its native action format (e.g., delta‑EEF for A, absolute joint angles for B).

Step 5: Back‑propagate gradients; the shared backbone updates to respect both visual and prompt cues.

The mixture lets the model learn a single set of parameters that can switch instantly between disparate action conventions simply by changing the prompt.

How does this differ from naïvely unifying all actions into a single joint‑space representation?

Unifying forces every dataset to adopt the same coordinate system, discarding dataset‑specific semantics (e.g., delta vs. absolute commands). The prompt‑conditioned approach preserves each dataset’s native format, so the model can exploit the richer supervision each format provides while still learning a common backbone.

**Table 1.** Action representation. Different datasets adopt different action conventions: some provide absolute end-effector poses in Cartesian space (Xie et al., 2026), others provide delta end-effector commands, and others use absolute or relative joint-space control signals. We preserve each dataset's original action format rather than converting to a shared representation, relying on embodiment-aware prompt conditioning to inform the model of the current control convention.

Action Normalization

Normalize heterogeneous action dimensions and align multi‑view observations for robust pretraining.

Action data come from many robots and simulators, each with its own units and range; training a single model on such mismatched scales leads to unstable gradients and poor transfer.

Think of each robot’s joystick being recalibrated so that its full swing maps to the same –1 to 1 interval — the model then sees a uniform control space regardless of the underlying hardware.

How does this differ from ordinary min‑max scaling that uses the absolute min and max?

Min‑max scaling would map the absolute extremes to –1 and 1, so a single outlier could shrink the effective dynamic range for the majority of samples. Quantile scaling ignores the tails, keeping the bulk of the distribution well‑conditioned while still bounding the output.

Compute the range: $q_{k}^{99}-q_{k}^{01}=10-0=10$.

Shift and divide: for $a_1$, $(2-0)/10=0.2$; for $a_2$, $(8-0)/10=0.8$.

Stretch to $[-1,1]$: $\tilde{a}_1=2\cdot0.2-1=-0.6$, $\tilde{a}_2=2\cdot0.8-1=0.6$.

Clip (both already lie in $[-1,1]$), so final normalized values are $-0.6$ and $+0.6$.

The mapping preserves the ordering (2 < 8 ⇒ –0.6 < +0.6) while compressing the extreme tails, yielding a stable range for downstream learning.

Language instructions are merged from original annotations and model‑generated captions, then filtered for consistency with the observed motion; mismatched pairs are discarded to guarantee reliable language‑action alignment.

Each image token is wrapped by view‑specific boundary tokens—e.g., <|ego|> … <|/ego|>—so the VLM backbone knows whether the pixel stream comes from an egocentric head camera or a wrist‑mounted lens.

We run a multi‑stage cleaning pipeline that removes trajectories with corrupted frames, near‑zero‑variance motions, or abnormal episode lengths; for datasets lacking explicit actions we synthesize pseudo‑actions via finite differences on proprioceptive states.

Egocentric human demonstrations provide abundant, diverse manipulation experience that far exceeds the limited scope of teleoperated robot trajectories, giving the model richer priors for downstream tasks.

Our egocentric corpus aggregates four public sources: (1) Ego4D/EPIC‑KITCHENS processed by VITRA, (2) EgoDex captured with Apple Vision Pro (829 h), (3) EgoVerse (1,300 h, 1,965 tasks), and (4) Xperience with synchronized depth and motion capture.

For each hand we predict a future SE(3) wrist transform (translation + axis‑angle rotation) and ten PCA‑derived eigengrasp coefficients that capture the dominant modes of 45‑dimensional joint pose, yielding 32 action dimensions per timestep.

Synthetic Simulation Data

We synthesize diverse VLA and language‑action data by randomizing scenes, tasks, and dynamics, then segmenting trajectories.

Existing real‑world demonstrations are limited in coverage and costly to collect. To obtain controllable, diverse supervision we build a large‑scale synthetic pipeline that produces both vision‑language‑action (VLA) and language‑only action data.

We simulate robot manipulation scenes, randomize visual and dynamic factors, roll out many successful trajectories, and then split each trajectory into subtask chunks so the model learns both high‑level instructions and low‑level action primitives.

How does this synthetic data differ from real‑world robot demonstrations?

Real demonstrations are scarce and fixed in appearance; synthetic data can be generated at massive scale, with systematic randomization of visual, geometric, and control factors, and with guaranteed success via motion‑planning checks, giving the model exposure to a far broader distribution of scenes and dynamics.

Scene 1: objects placed at positions (0,0) and (0.2,0). Random lighting set to bright; camera at front.

Trajectory 1: planner computes a collision‑free path, rolls out 50 Hz joint states, records end‑effector pose reaching the cube, grasping, moving to the cylinder, and releasing.

Subtask segmentation splits the trajectory after the grasp event, yielding a “pick” subtrajectory and a “place” subtrajectory.

Scene 2: same objects but swapped initial positions, lighting dim, camera side view. Trajectory 2 follows analogous steps with different joint angles.

Even this tiny setup shows how randomizing visual conditions and breaking full motions into subtask chunks gives the model multiple learning signals: a full‑task example for sequencing and isolated primitives for fine‑grained action grounding.

**Figure.** Examples of short-horizon and long-horizon robotic manipulation tasks.

Navigation and VL Data

We enrich pretraining with diverse vision‑language sources, notably a two‑stage fine‑grained caption pipeline.

Coarse task labels (e.g., “pick up the bowl”) leave the model guessing the exact motion, which hampers policy learning. To resolve this ambiguity we inject richer vision‑language supervision that ties language to every low‑level action. The section’s core contribution is a two‑stage pipeline that turns raw videos into dense, step‑by‑step captions.

The pipeline first extracts a coarse action sketch, then refines it frame‑by‑frame into a detailed script that enumerates primitives, contacts, and trajectories.

How does this pipeline differ from ordinary video captioning?

Standard video captioning produces a single sentence describing the overall event. Our pipeline produces a sequence of 13‑dimensional, frame‑aligned steps, explicitly grounding each primitive to visual evidence and correcting view‑specific errors, which is essential for learning precise embodied policies.

Stage 1 outputs the coarse action “pick up cup” and identifies the cup as the target object.

Stage 2 processes frame t₁: detects the gripper approaching the cup’s right side → “approach cup from right”.

Stage 2 processes frame t₂: gripper contacts the cup’s handle → “grasp handle”.

Stage 2 processes frame t₃: lifts and places the cup on the table center → “lift cup” then “place cup at table center”.

The fine‑grained script disambiguates left/right approaches and reveals the exact contact point, information that a single coarse label cannot provide.

Beyond manipulation, we augment pretraining with autonomous‑driving VQA data (2.4 %) to teach temporal scene understanding, surround‑view spatial reasoning, language‑grounded localization, and planning‑aware reasoning. Each source is converted to a unified conversational format with frame and view tags.

**Table.** Comparison between coarse and fine-grained labels for a robotic task.

We also incorporate 2.5 % spatial grounding data (2D bounding boxes) to teach precise object localization, and a 3.4 % general vision‑language mixture (captioning, VQA, OCR, instruction following, referring expression) that is up‑weighted toward video‑centric, spatial‑relation, and 3D‑aware supervision to prevent catastrophic forgetting.

Post-Training and SFT

Specialized post‑training turns a generalist model into a precise controller.

Pretraining gives Qwen‑VLA‑Base broad cross‑task knowledge but falls short on the millimetre‑scale precision required for closed‑loop robot control. The post‑training pipeline injects task‑specific accuracy without discarding the generalist foundation.

Read the original paper

Open the simplified reader on Paperglide