Robots Need More Than VLAs & World Models

Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar

Generalist robotics requires moving beyond policy scaling to a grounding-centric pipeline that converts unstructured physical data into robot-usable supervision.

Why is scaling Vision-Language-Action (VLA) models and world models insufficient for achieving generalist robot intelligence?

Current robot learning relies on "robot-native" data—trajectories where actions, task labels, and rewards are already explicitly defined. This makes scaling expensive, as every new task requires curated, embodiment-specific demonstrations. The authors argue that the field must shift to a "grounding-centric" pipeline: a stack of mechanisms that transform abundant, unstructured physical experience—like human video or simulation rollouts—into robot-usable actions, contacts, and rewards. The paper identifies four missing pillars for this transition: data interfaces for autolabelling, embodiment interfaces for motion retargeting, world-model interfaces for 3D reasoning, and reward interfaces for task-progress inference.

Paper Primer

The current "robot-native" paradigm is a bottleneck because it treats physical experience as a commodity that must be pre-formatted for a specific body. The authors propose that Vision-Language-Action (VLA) models are not the entire solution, but rather one layer in a larger stack that depends on upstream grounding of dynamics, rewards, and embodiment constraints.

The central scaling limit in robotics is the grounding step, not the policy architecture.

Existing VLA models (e.g., RT-2, OpenVLA) show that performance scales with data diversity, but they remain dependent on supervision that has already been expressed in the coordinate system of a robot-learning problem. While robot-native datasets have grown to over 1 million trajectories, they remain tiny compared to the vast, unlabelled physical behaviour present in the world.

Why is the current "robot-native" approach insufficient for generalist robotics?

Robot-native data requires every trajectory to be physically executable, tied to a specific body, and manually labelled with task semantics or rewards, making it prohibitively expensive to scale to the diversity of the real world.

What is the "grounding-centric" pipeline, and how does it differ from current methods?

Instead of collecting robot-specific demonstrations, the grounding-centric pipeline uses mechanisms to extract robot-usable signals (like latent actions or reward functions) from passive sources like human video, internet-scale data, and simulation.

The Scaling Bottleneck in Robotics

Robot intelligence stalls because scaling VLA models alone cannot turn raw physical data into robot‑usable supervision.

Generalist robot intelligence is often framed as a policy‑scaling problem: collect more robot demonstrations, train larger Vision‑Language‑Action (VLA) models, and expect broader generalisation. We argue that this framing is incomplete because the real bottleneck is the lack of mechanisms that turn abundant, unstructured physical behaviour into grounded robot supervision.

Scaling VLA models alone cannot close the gap to generalist robot intelligence; the missing piece is a way to ground raw physical experience into robot‑usable signals.

Vision‑Language‑Action models combine visual perception, language understanding, and action generation into a single network that predicts robot commands from image‑text pairs.

World models learn predictive dynamics of the physical environment, enabling a robot to simulate outcomes of imagined actions before execution.

We identify four missing components that would close this grounding gap: (1) data interfaces that autolabel unstructured behaviour, (2) embodiment interfaces that retarget human motion to robot actions, (3) world‑model interfaces that provide physics‑grounded 3‑D reasoning, and (4) reward interfaces that infer task progress from video and language. Shifting the robotics pipeline from a robot‑data‑centric to a grounding‑centric design is therefore the key research direction.

**Figure 1.** Next generation robotics will come from advances that go well beyond scaling vision language action (VLA) models.

The field must move from pure scaling of VLA models to architectural and supervision innovations that ground physical experience.

Limits of Robot-Native Supervision

Robot-native supervision underpins current generalist robot policies but also limits scaling.

Robot‑native supervision—collecting trajectories where observations are already paired with robot actions—has driven the most impressive gains in generalist robot policies.

Data that is already expressed in the robot’s own coordinate system—observations paired with embodiment‑specific actions, task labels, or success signals—so a learning algorithm can consume it directly without any additional grounding step.

How does robot‑native supervision differ from generic imitation learning that uses human videos?

Robot‑native supervision provides actions that are executable by the target robot, whereas generic imitation learning from human videos only supplies visual cues; the latter lacks a direct mapping from observed motion to robot‑specific motor commands and therefore requires an additional grounding step.

To clarify the landscape we organise the supervision approaches into three camps.

Trajectories where observations are already paired with robot‑specific actions, task labels, or success signals.

Passive videos (human or internet) that provide visual or linguistic cues but lack explicit robot actions.

Simulation or learned world‑model rollouts that synthesize new robot trajectories.

Components for Physical Intelligence

We map the four missing components of physical intelligence and compare their roles.

The field map treats each missing component as a distinct “camp” and asks how they differ in input, output, and core function.

Physical intelligence is built from four complementary modules that turn raw, heterogeneous experience into robot‑usable supervision.

A system that aligns heterogeneous streams (video, motion capture, tactile, robot logs, language) and extracts robot‑relevant labels such as object states, contacts, task phases, latent actions, and rewards.

A mapping that converts latent physical actions into embodiment‑specific robot commands while preserving the task‑relevant physical effect.

A predictive model that estimates future object‑centric states, contacts, forces, and constraints conditioned on actions and goals.

A closed‑loop system that monitors execution, grounds task‑conditioned rewards, and feeds the resulting supervision back into the other three components.

The research agenda must move beyond scaling VLA models toward building the four grounding modules that turn raw physical experience into robot‑usable supervision.

Read the original paper

Open the simplified reader on Paperglide