Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

Humanoid-GPT scales motion tracking to 2 billion frames using a causal Transformer to achieve zero-shot generalization.

How can we scale humanoid motion tracking to achieve zero-shot generalization across diverse, high-dynamic motions?

Current humanoid trackers rely on shallow MLPs trained on small datasets, forcing a trade-off where models either excel at agile in-domain motions or generalize poorly to unseen tasks. Humanoid-GPT replaces these limited architectures with a causal Transformer trained on a 2-billion-frame corpus, distilling specialized motion experts into a single generalist policy. This scaling approach establishes a new performance frontier, enabling robust zero-shot tracking of complex, dynamic motions on real hardware without task-specific fine-tuning.

Paper Primer

The core mechanism hinges on a two-stage pipeline: first, training multiple Reinforcement Learning (RL) experts on motion clusters defined by Harmonic Motion Embedding (HME), then distilling these experts into a unified causal Transformer. The Transformer acts as a sequence model that takes proprioceptive states and reference poses as tokens, using causal attention to predict motor targets in real-time.

The HME representation is like a musical score analyzer: it extracts periodic frequencies and amplitudes from raw motion data to group similar movements, ensuring the model learns from a balanced distribution rather than just the most frequent motion styles.

Humanoid-GPT achieves superior zero-shot generalization and tracking precision compared to MLP-based baselines.

In simulation, the largest model (Humanoid-GPT-L) achieves a 92.58% success rate on unseen motions, while the best MLP baseline lags by 30% in tracking precision (MPKPE). The training corpus is over 200× larger than prior tracker datasets, totaling 2 billion frames.

Scaling analysis reveals that while MLP and TCN architectures saturate quickly as data increases, the Transformer-based Humanoid-GPT continues to improve, confirming that the architecture is the primary bottleneck for scaling humanoid control.

Why use a Transformer instead of the standard MLP controllers used in previous humanoid trackers?

MLP capacity saturates as data grows, whereas the Transformer's causal attention structure scales cleanly with both model size and data volume, allowing it to capture long-horizon dependencies that MLPs miss.

What is the role of the "Harmonic Motion Embedding" (HME) in this system?

HME provides a way to cluster motions by their periodic characteristics, enabling the authors to sample training data in a distribution-balanced way so that rare but important behaviors are not drowned out by common motion styles.

The authors demonstrate that scaling is not just about adding more data; it requires a fundamental shift in architecture (to causal Transformers) and data processing (to balanced, diversity-aware sampling) to avoid the performance plateaus seen in previous work.

For researchers in embodied AI, this paper shifts the paradigm from "curated small-scale imitation" to "large-scale generative sequence modeling," suggesting that humanoid control can follow the same scaling laws as language and vision models.

Introduction and Motivation

We expose the agility‑generalization trade‑off in humanoid tracking and motivate scaling with a causal Transformer.

Current humanoid motion trackers are shallow MLPs trained on tiny datasets, which forces a trade‑off between agility and zero‑shot generalization. When a tracker excels at fast, dynamic motions it typically collapses on unseen styles, and vice‑versa. Scaling data and model capacity promises to break this trade‑off.

They map recent observations to joint targets using a few fully‑connected layers, limiting capacity and temporal context.

How do shallow MLP trackers differ from the causal Transformer used in Humanoid‑GPT?

MLPs provide a fixed, shallow mapping and cannot capture long‑range temporal dependencies, whereas the causal Transformer processes the entire observation history with self‑attention, scaling its capacity with model size and data.

**Figure 1.** Overview of Humanoid-GPT, featuring large-scale data integration, a causal transformer architecture, and comparative scaling metrics against existing methods, alongside demonstrations of real-time teleoperation and zero-shot physical tasks.

The agility‑generalization trade‑off in humanoid control can be broken by scaling data and model capacity with a causal Transformer.

Related Work

Key prior datasets and trackers that shape scalable motion modeling.

Large‑scale motion datasets have become essential for learning generalizable human motion tracking. Early datasets offered high‑quality but studio‑constrained motions, limiting diversity. Recent video‑based reconstruction and synthetic generation have expanded coverage, adding physically consistent motions with contact modeling, joint constraints, and reduced foot‑sliding, which provide stronger physical priors.

Physics‑based tracking aims to produce temporally coherent, dynamically feasible whole‑body control from reference motions. Early works established the paradigm by coupling imitation with contact‑aware stability in simulation. Subsequent pipelines extended these ideas to real‑world deployment on specific platforms.

Recent efforts such as GMT use a Mixture‑of‑Experts with adaptive sampling, while UniTracker adopts a CVAE‑based teacher‑student framework to broaden coverage. SONIC scales an MLP controller to 100 M frames, but MLP capacity saturates as data grows. HumanPlus introduces a Transformer controller trained with standard PPO, yet it does not exploit the parallelism advantage inherent to Transformers.

Humanoid‑GPT reframes tracking as GPT‑style sequence modeling, distilling hundreds of RL experts into a causal Transformer trained on 2 B frames. This yields strong zero‑shot tracking and generalization where similarly sized MLPs plateau. The approach leverages Transformer scalability to overcome the agility‑generalization trade‑off of shallow MLP trackers.

Data Curation and HME

We curate a diverse, physically consistent motion corpus and embed it with a harmonic representation.

Constructing a high‑quality motion dataset is essential for zero‑shot humanoid tracking because existing collections either lack sufficient action categories or contain physically implausible poses. We therefore merge AMASS, LAFAN1, MotionMillion, and PHUMA, retarget every clip to the 29‑DoF Unitree‑G1 skeleton, discard sequences with object interactions, and apply uniform time‑warping to enlarge the corpus by roughly five times.

HME compresses an entire motion sequence into a short vector that captures the periodic joint behavior, making it easy to compare and cluster motions.

Compute the per‑joint mean amplitude: $(0.4+0.6)/2 = 0.5$ for joint 1, $(0.8+0.9)/2 = 0.85$ for joint 2.

Compute the per‑joint amplitude standard deviation: $\sqrt{[(0.4-0.5)^2+(0.6-0.5)^2]/2}=0.1$ for joint 1, $\sqrt{[(0.8-0.85)^2+(0.9-0.85)^2]/2}=0.05$ for joint 2.

Do the same for frequencies, yielding means $0.95\,$Hz and $0.55\,$Hz and stds $0.05\,$Hz and $0.05\,$Hz.

Concatenate all four statistics into the HME vector $[0.5,0.85,0.1,0.05,0.95,0.55,0.05,0.05]$.

When clustering, this 8‑dimensional vector is compared with others using Euclidean distance.

The HME vector preserves both the typical joint motion (means) and its variability (stds), which a plain pose average would discard.

How does HME differ from a simple pose‑averaging embedding?

Pose averaging collapses a clip to a single static skeleton, losing any rhythmic information. HME, by contrast, encodes the sinusoidal amplitudes and frequencies of each joint and also records their variance, so two clips with identical average poses but different motion styles end up in distinct clusters.

**Figure 3.** Comparison of dataset diversity in the HME embedding space. Each bubble represents a dataset, where the horizontal and vertical axes denote $gstd$ and $log-volume$ respectively, and the bubble size reflects the relative amount of motion clips. Upper-right bubbles indicate broader coverage and higher diversity.

By clustering the corpus with HME we obtain a set of motion experts, each trained on a coherent subset of clips; this partitioning balances coverage and training efficiency for the downstream Humanoid‑GPT policy.

Training Motion Experts

We train clustered PPO experts with keypoint‑level rewards to cover diverse motions.

Training a single shallow tracker cannot capture the full dynamic range of human motion, so we partition the dataset into clusters and train a dedicated expert for each.

Each expert is a PPO policy that receives the current reference pose and the robot’s privileged state, then outputs joint torques that try to imitate the reference while staying physically stable.

How does a motion expert differ from a vanilla PPO policy trained on a single task?

Besides the usual PPO objective, the expert conditions on a continuously changing reference pose $q_{\text{ref}}$ and receives a dense keypoint‑level reward that jointly penalizes position, rotation, and velocity errors. This forces the policy to track a moving target rather than maximize a static reward.

At time $t$, position residuals are $e_{1,t}=(0.04, -0.02, 0.01)$ m and $e_{2,t}=(-0.03, 0.05, -0.02)$ m.

Velocity residuals are $\dot e_{1,t}=(0.1, -0.05, 0.0)$ m/s and $\dot e_{2,t}=(-0.08, 0.12, -0.04)$ m/s.

Rotation errors are $\theta_{1,t}=0.03$ rad and $\theta_{2,t}=0.07$ rad.

Compute per‑keypoint penalties: $p_{1}=w_1\big(\alpha_{\text{pos}}\|e_{1,t}\|^2+\alpha_{\text{rot}}\theta_{1,t}^2+\alpha_{\text{vel}}\|\dot e_{1,t}\|^2\big)=1.0(1.0\cdot0.0021+0.5\cdot0.0009+0.2\cdot0.013)=0.0049$; $p_{2}=0.5(1.0\cdot0.0038+0.5\cdot0.0049+0.2\cdot0.022)=0.0065$.

Sum the penalties: total reward penalty $=p_{1}+p_{2}=0.0114$. The expert receives $r_t = -0.0114$, encouraging it to reduce the residuals.

The weighted, scaled formulation lets the trainer emphasize critical limbs (higher $w_k$) and prioritize position over rotation or velocity by adjusting the $\alpha$ factors.

**Figure 2.** Overview of Humanoid-GPT. The system consists of three stages: (a) data curation and processing, (b) training PPO-based motion experts on clusters with keypoint-level rewards, and (c) distilling all experts into a single Transformer-based generalist policy via parallel DAgger supervision. The resulting Humanoid-GPT can take unseen or online retargeted motions as reference inputs and track them in a fully zero-shot manner.

Humanoid‑GPT later distills all experts into a single causal Transformer, yielding a zero‑shot tracker that can follow any new motion without further finetuning.

Distillation and Transformer Policy

Distilling multiple motion experts into a single causal Transformer policy via DAgger.

Training a single policy that reproduces the behavior of dozens of motion experts would normally require many separate rollouts. By framing the problem as sequence modeling, we can let one causal Transformer absorb all expert trajectories in a single forward pass.

We treat distillation as a sequence‑modeling task so a causal Transformer can learn from many expert timesteps at once, rather than copying a single step per example.

Form the token sequence $[e_{t-2}, e_{t-1}, e_{t}]$ and feed it to the causal Transformer $G_{\theta}$.

The Transformer produces predicted actions $[\tilde{a}_{t-2}, \tilde{a}_{t-1}, \tilde{a}_{t}]$ in a single forward pass.

Stack the expert actions into $\hat{a}_{t-2:t}= (0.2, 0.3, 0.4)$ using the concatenation formula.

Compute the SmoothL1Loss $l = L([\tilde{a}_{t-2},\tilde{a}_{t-1},\tilde{a}_{t}],\; (0.2,0.3,0.4))$.

Back‑propagate $l$ to update $\theta$, completing one DAgger distillation step for the window.

This toy example shows that a single forward pass supplies supervision for three timesteps, cutting the number of required rollouts by a factor of $H$.

How does this DAgger Distillation differ from ordinary behavior cloning?

Behavior cloning trains on isolated state‑action pairs, ignoring the effect of earlier predictions. DAgger Distillation feeds a whole history window into a causal Transformer, supervises every output position with the expert, and thus learns to correct its own predictions over multiple timesteps in one pass.

Experimental Setup

Scaling up data and model size yields steady gains in zero‑shot humanoid tracking.

The paper showed that Humanoid‑GPT distills a billion‑frame motion corpus into a causal Transformer policy, overcoming the agility‑generalization trade‑off of shallow MLP trackers. Here we examine how that scaling translates into zero‑shot tracking performance.

The training reward $R_{\text{kpt}}(t)$ combines position, rotation, velocity, and penalty terms, each penalized exponentially to keep deviations small. Experts are evaluated on root‑pose error, velocity error, and stable‑tracking duration before being admitted to the motion‑prior library.

Unitree‑G1 is the 30‑kg, 20‑DoF humanoid robot used for all real‑world tracking benchmarks in this work.

Increasing both training tokens and model parameters yields consistent improvements in zero‑shot stability (SR) across all backbones.

Table 2 shows SR rising from 76.89 % for a 3‑layer MLP (2 M tokens) to 92.58 % for Humanoid‑GPT‑L (2 B tokens, 80.4 M parameters).

Beyond sheer scale, we probe two orthogonal factors: the diversity of motion data and the architectural choice. Diversity analyses (Section 8) reveal that balanced coverage of locomotion, manipulation, and acrobatics correlates with higher zero‑shot SR. Architectural experiments (Section 9) confirm that the causal Transformer captures long‑horizon dynamics more robustly than TCNs or shallow MLPs.

Data Diversity Analysis

Dataset diversity directly boosts zero‑shot humanoid tracking performance.

Our curated dataset achieves roughly 4–5× larger log‑volume than AMASS, indicating far greater latent coverage.

Measured log‑volume on 10,000 HME embeddings per dataset shows a 4–5× increase over AMASS.

We compare three progressively larger collections—AMASS, AMASS+LAFAN, PHUMA—and our curated dataset, which unifies all four sources, to assess how motion variety influences zero‑shot tracking.

The ≈4–5× log‑volume boost demonstrates that aggregating multiple sources dramatically widens the latent motion manifold, which in turn supplies a more expressive prior for zero‑shot humanoid tracking.

Simulation Evaluation

Zero-shot tracking improves as data and model size increase in simulation.

Scaling the motion corpus to 2 B frames reduces zero‑shot MPJPE to 0.094 rad, the lowest error observed across all configurations.

Figure 7 shows a monotonic decline in MPJPE as data scale grows, reaching 0.094 rad at 2 B frames.

**Table 2.** Comparison of backbone architectures and scaling effects. Larger datasets and higher-capacity Transformers consistently improve stability and zero-shot tracking accuracy across all metrics.

**Figure 7.** Data Scaling up Curve on Zero-shot Performance.

**Figure 8.** Model Scalability Comparison.

Real-World Evaluation

Humanoid‑GPT scales humanoid motion tracking by distilling a motion corpus into a causal Transformer, beating the agility‑generalization trade‑off.

We now assess how the distilled Humanoid‑GPT tracker behaves on a physical Unitree‑G1 robot when faced with motions it has never seen during training.

Humanoid‑GPT‑B reduces MPJPE by up to 30 % compared to the strongest baseline on unseen real‑world motions.

Table 3 shows Humanoid‑GPT‑B achieving 0.6209 mm MPJPE versus 0.7821 mm for the best baseline (Any2Track) on the Gokuraku Joudo clip, a 30 % reduction.

Our final optimization runs about 5× faster than the TWIST pipeline.

Figure 5 reports a mean latency of 0.39 ms for the C++ COMM implementation versus roughly 2.0 ms for TWIST, a five‑fold speed‑up.

**Figure 4.** Real-world experiments for our Humanoid-GPT. All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motion in a zero-shot manner.

**Figure 5** Comparison of inference latency among different optimization methods. Our final optimization reaches about 5 times faster than TWIST.

In an online whole‑body teleoperation demo, a live MoCap stream drives the robot in real time; the tracker directly maps actor poses to joint commands, preserving balance and fluidity across squats, turns, and expressive arm gestures.

Additional Visualizations

Humanoid‑GPT delivers sub‑1.5 ms inference latency while scaling to billions of frames and enabling diverse zero‑shot behaviors.

Humanoid‑GPT achieves real‑time control with sub‑1.5 ms inference latency, beating a comparable MLP baseline by 3.5 ms.

Measured end‑to‑end on a single NVIDIA RTX 4090 GPU using the optimized C++ streaming pipeline.

We demonstrate the tracker on a real Unitree‑G1 robot, performing teleoperation tasks such as basketball shooting, collaborative box carrying, and a roll‑over‑and‑stand‑up maneuver, as well as zero‑shot dance routines captured from video and retargeted to the robot’s kinematics.

The formal scaling‑law analysis reveals a monotonic relationship between dataset diversity and generalization, with diminishing marginal gains beyond 200 M tokens, indicating a data‑limited regime for the current model capacity.

Training on token counts $T \in \{2\text{M}, 20\text{M}, 200\text{M}, 2\text{B}\}$ shows that performance improves with more data, yet the gain from 200 M to 2 B tokens is modest.

Comparing a Transformer‑B model to an MLP of similar parameter count, both trained on 2 B tokens, the Transformer continues to improve with training progress while the MLP plateaus early, confirming superior scalability.

In contrast, a comparable MLP baseline incurs latency above 5 ms, violating the strict real‑time budget required for whole‑body humanoid control.

Ablation Studies and Implementation

We detail how component removals affect Humanoid‑GPT performance, confirming each design choice.

Recall that Humanoid‑GPT scales humanoid motion tracking by distilling a billion‑frame corpus into a causal Transformer, sidestepping the agility‑generalization trade‑off of shallow MLP trackers.

**Figure 10.** Ablation studies for Humanoid-GPT.

Coarse clustering (128 experts) mixes heterogeneous motions, degrading teacher tracking, while overly fine clustering (1024 experts) inflates training cost and yields conflicting student guidance; 384 clusters strike the best trade‑off.

Extending the transformer’s history improves accuracy monotonically up to 64 frames, but the quadratic runtime growth forces a practical default of 32 frames.

Increasing the number of DAgger rollout environments to 32 K prevents overfitting to a limited set of reference motions, ensuring the distilled policy sees the full motion distribution.

RL experts are trained with PPO in MuJoCo episodes of 600–1200 frames at $50\,\text{Hz}$; rollouts terminate on falls, joint‑limit violations, or timeout.

Domain randomizations span three categories: terrains (floor friction $U(0.3,2.0)$, max height $U(10.0,16.0)$, noise parameters), external forces (interval $U(0.1,1.0)$, velocity $U(5.0,10.0)$), and physical property changes (DoF friction $U(0.5,2.0)$, armature $U(1.0,1.05)$, torso CoM $U(-0.15,0.15)$, mass $U(-3.0,6.0)$, position jitter $U(-0.05,0.05)$).

Table 6 lists reward‑weight coefficients: lower‑body keypoints $w_k=1.5$, upper‑body $w_k=0.75$, position $\alpha_{pos}=1.0$, orientation $\alpha_{rot}=2.0$, linear velocity $\alpha_{vel}=0.03$.

Table 7 specifies DAgger BC hyperparameters: $32768$ environments, batch size $32768$, gradient clipping $1.0$, learning rate $1\times10^{-4}$, $12$ layers, channel dimensions $[256,384,768]$, optimizer AdamW, $200\text{k}$ training iterations.

The t‑SNE plot reveals that our dataset occupies a substantially larger latent region than the combined AMASS and LAFAN collections, indicating richer motion diversity.

Deployment runs on a single NVIDIA RTX 4090 with an Intel Core i9‑14900KF; the model is exported to ONNX (FP32) and optimized with TensorRT, achieving a closed‑loop control frequency of $50\,\text{Hz}$.

Appendix Figures

Illustrates zero-shot tracking across diverse motions and shows the motion dataset composition.

**Figure 6.** Additional Real-world experiments for our Humanoid-GPT. All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motion in a zero-shot manner, especially various dance motions.

**Figure 9.** Data distribution visualization.

Conclusion and Future Work

Humanoid‑GPT unifies agile, stable, zero‑shot humanoid motion tracking at scale.

Humanoid‑GPT scales humanoid motion tracking by distilling a diverse, billion‑frame motion corpus into a causal Transformer policy, achieving unified agility, stability, and zero‑shot generalization.

In simulation and on the real Unitree‑G1 robot, the system transfers strongly without any fine‑tuning, enabling reliable real‑time whole‑body imitation.

Future work will incorporate richer modalities such as contacts, vision, or language, and extend the framework to interactive or multi‑agent scenarios.

We also see potential in coupling Humanoid‑GPT with longer‑horizon planning or VLA‑style instruction toward more general‑purpose embodied foundation models.

Science of Scale: we are the first motion tracker with zero‑shot ability trained on 2 B frames, a dataset over 200× larger than prior trackers, and scaling required redesigning the reward and retuning key hyperparameters.

Modern Structure: a scalable causal Transformer respects the online constraint of no future observations and scales more efficiently than shallow MLP or non‑causal alternatives.

Balanced Diversity Matters: HME Representation Learning provides diversity‑aware, distribution‑balanced sampling, and we find both diversity and balance are critical for a general tracker.

Scale and Data

We build a 2‑billion‑frame motion corpus and show scaling enables video‑estimated motion to improve tracking.

Scaling the training data to billions of frames forces us to revisit reward design and hyper‑parameter tuning, because tricks that work on modest datasets break at this regime. The resulting corpus, drawn from dozens of public mocap libraries and a large internal capture, provides the first systematic proof that video‑estimated motion can boost zero‑shot tracking when both model capacity and data volume are large.

Balanced Diversity and Results Overview

We outline the causal Transformer design, balanced diversity sampling, and scaling insights that drive Humanoid-GPT.

Motion tracking for control must be causal: the policy cannot see future observations. Existing trackers often rely on non‑causal designs or capacity‑limited MLPs, which stall early. We therefore adopt a scalable Transformer with GPT‑style causal attention that predicts per‑joint PD targets using temporal attention, matching the deployment constraint by construction.

More data alone does not guarantee better generalization because common motion styles dominate the long tail, suppressing rare but crucial behaviors. To address this, we employ Harmonic Motion Embedding (HME), a representation that quantifies and organizes motion diversity directly from raw data, enabling diversity‑aware, distribution‑balanced sampling during training. Our analysis shows that both diversity and balance are required: without balance, frequent modes are over‑fit; without diversity, capability is capped.

Combining the causal Transformer and balanced‑diversity sampling, Humanoid‑GPT markedly improves agility and zero‑shot tracking performance. We also derive a scaling law that relates tracking performance to data scale and model capacity, providing a concrete roadmap for future whole‑body control systems.

Prior work either applies Transformers to limited motion hours or scales MLP‑based policies on hundreds of millions of frames. Humanoid‑GPT is, to our knowledge, the first system that (i) distills a large library of RL motion experts into a single GPT‑style tracker, (ii) trains on a curated 2 B‑frame corpus, and (iii) systematically characterizes how data scale, model scale, and diversity balance jointly govern zero‑shot agile motion tracking on real humanoid hardware.

Read the original paper

Open the simplified reader on Paperglide