LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

LLaVA-OneVision-2 uses codec-stream tokenization to allocate visual tokens based on bit-cost dynamics rather than fixed frame sampling.

How can we replace uniform frame sampling in video-language models with codec-aware tokenization to improve temporal grounding and efficiency?

Current vision-language models reduce video to a set of uniformly sampled frames, which discards continuous motion dynamics and wastes tokens on redundant, predictable visual content. The authors introduce codec-stream tokenization: a method that treats compressed video as a continuous bit-cost stream, using bit-cost spikes and motion residuals to adaptively concentrate visual tokens on event-bearing transitions. This approach enables stable long-video compression and significantly improves temporal grounding, with the 8B model outperforming Qwen3-VL-8B by 44.8 points on the new JumpScore benchmark.

Paper Primer

The core mechanism hinges on treating video as a variable-length stream of bit-cost and motion-residual data. Instead of fixed frame slots, the model uses bit-cost dynamics to define adaptive temporal groups and motion-residual saliency to select the most informative spatial patches for the visual encoder.

Codec-stream tokenization significantly improves temporal grounding in high-frequency, repetitive motion.

Performance on the JumpScore benchmark, which requires localizing cycle boundaries in jump-rope videos. LLaVA-OneVision-2-8B achieves 74.9 mAP, a +44.8 point lead over Qwen3-VL-8B (30.1).

Codec-stream inputs outperform uniform frame sampling under matched token budgets.

Ablation study on temporal grounding tasks. An average improvement of +9.7 points across temporal grounding benchmarks.

Why does this approach matter for long-video understanding?

Uniform frame sampling often misses short, critical event intervals that fall between sampled frames. By allocating tokens based on bit-cost and motion residuals, the model ensures that high-information transitions are captured without needing to increase the total token budget.

Does this method replace the need for frame sampling entirely?

No. The authors observe that frame sampling remains superior for detail-sensitive queries where decisive cues are static, fine-grained, or spatially small, as dense frame observations better preserve local texture and appearance.

Researchers can now treat video as a predictive stream rather than a static frame sequence, allowing for more efficient token allocation in long-form video tasks without modifying the underlying language model architecture.

Introduction and Motivation

We expose why frame‑centric video models waste compute and propose a stream‑aware token allocation.

Current video‑language models observe video by uniformly sampling frames, which discards the continuous motion information that codecs already encode. This frame‑centric paradigm forces a fixed token budget to be spread thinly over many redundant frames, leading to wasted compute and poor long‑video reasoning.

Instead of allocating tokens by elapsed time, the model follows the video codec’s bit‑cost stream, concentrating tokens where the bit‑cost spikes and motion‑residual cues indicate perceptual change.

Compute the codec’s per‑frame bit‑cost; high‑motion segments (e.g., frames 120‑150) show spikes up to 8 kb, while static segments stay near 1 kb.

Group consecutive frames until the cumulative bit‑cost exceeds a threshold of 32 kb, yielding 200 adaptive groups.

Assign $4$ tokens to each high‑cost group (total $800$ tokens) and distribute the remaining $224$ tokens to auxiliary modalities (e.g., audio, captions).

This toy calculation shows how the token budget is re‑shaped from a uniform per‑frame allocation to a content‑driven distribution that emphasizes perceptual transitions.

**Figure 1.** Roadmap of video understanding from token compression to codec-aligned perceptual intelligence. The roadmap traces the evolution from early frame/clip sampling and hand-crafted visual features, to heuristic token compression learned token selection, and the 2026 codec-aligned paradigm represented by LLaVA-OneVision-2.

The core shift is from a frame‑centric view to a stream‑aware observation that allocates tokens where the video codec signals change.

Model Architecture

The architecture unifies video, short video, and image inputs via codec‑stream tokenization and shared attention groups.

Video MLLMs waste compute on frames that add little new information. LLaVA-OneVision-2 solves this by feeding all visual modalities into a single OneVision‑Encoder, then letting the tokenization front‑end decide which patches to keep. The encoder, a lightweight vision‑language connector, and the Qwen3 decoder remain unchanged across inputs.

Instead of allocating a fixed token slot per frame, the model groups patches by their codec bit‑cost and motion‑residual saliency, keeping only the most informative 2×2 blocks and assigning them to adaptive GOP groups that stay visible together in attention.

Compute $S_t$ for each patch; the highest‑scoring patches are (frame 0, block (0,0)) = 0.9 and (frame 4, block (1,1)) = 0.85.

Rank patches within each frame; the top two patches per frame receive ranks $\rho_t=0,1$.

Apply attenuation $A_z = A_{t,i,j}/(1+\lambda\rho_t)$ with $\lambda=0.5$, yielding attenuated scores 0.9, 0.6, 0.85, 0.57, …

Allocate $m_1=2$ P‑canvases to GOP 1 by splitting the cumulative mass curve at 0.5, selecting the top‑scoring non‑duplicate patches from frames 0‑3.

Allocate $m_2=1$ P‑canvas to GOP 2 using the same procedure, ending up with patches from frames 4‑7 that cover the remaining high‑score regions.

Even with a tiny toy video, the adaptive grouping concentrates tokens on frames with high motion‑residual energy while still preserving a balanced temporal coverage.

Adaptive GOP partition and P‑canvas packing (simplified).

**Figure 2. LLaVA-OneVision-2 architecture.** The model unifies codec-stream videos, sampled-frame videos, and native-resolution images under a shared visual-token interface. Codec inputs are encoded as I/P visual canvases, sampled videos as frame-token sequences, and images as spatial visual tokens; all inputs are processed by the OneVision-Encoder. The resulting visual embeddings are combined with text tokens and decoded by a pre-trained autoregressive language model, allowing a single architecture to support video and image understanding.

**Figure 3. Codec-stream tokenization.** P/B packet bit-cost partitions the video into adaptive GOPs; motion and residual signals jointly score spatial saliency; high-score 2x2 patch blocks are selected and packed into compact I/P canvases. Each GOP yields one anchor I-canvas and multiple P-canvases carrying motion-residual evidence, producing merge-aligned visual tokens whose density follows the bit-cost-residual profile of the stream rather than fixed frame slots.

**Figure 4.** Codec-stream grouping by cumulative bit-cost. For example, 448 sampled frames are divided into 13 codec-stream groups under a cumulative bit-cost threshold of 211,461. The top panel shows the frame-level bit-cost contribution in blue, where sharp peaks typically indicate rapid motion, viewpoint changes, or abrupt visual transitions. The bottom panel shows the cumulative bit-cost within each group in orange, which resets after every group boundary and approaches the red threshold before a new group is opened. Green dashed lines mark the resulting codec-stream group boundaries, and the bottom color bands indicate the number of frames covered by each group.

How does codec‑stream tokenization differ from the uniform frame sampling used in prior video MLLMs?

Uniform sampling assigns a fixed token budget to each frame regardless of its visual change, often wasting tokens on static content. Codec‑stream tokenization instead measures the bit‑cost of each GOP, scores motion‑residual saliency per 2×2 patch, and allocates tokens adaptively—high‑motion, high‑residual regions receive more tokens while low‑activity intervals are grouped into longer GOPs, preserving the overall token budget.

Training Data and Recipe

How the data pipeline evolves from image‑text foundations to codec‑aware long videos.

Standard uniform frame sampling wastes compute on frames that add little new visual information, especially for long videos where most frames are near‑duplicate.

Instead of taking every nth frame, the pipeline groups frames into variable‑length GOPs, scores each GOP by its bit‑cost (motion + residual), and keeps only the most informative patches—much like a news editor keeps the most newsworthy paragraphs while discarding filler.

Rank GOPs by bit‑cost: GOP 3 (20 bits) > GOP 1 (12 bits) > GOP 2 (5 bits) > GOP 4 (3 bits).

Select the top‑2 GOPs (GOP 3 and GOP 1) to stay within the token budget.

Unpack GOP 3 (frames 5‑6) and GOP 1 (frames 1‑2) into patch tokens; discard frames 3‑4 and 7‑8.

Feed the resulting 4 × patch‑tokens to the encoder; the model sees the most dynamic portions of the clip.

Bit‑cost ranking automatically concentrates tokens on high‑motion segments, so the model learns to attend to temporal changes without any hand‑crafted frame‑selection heuristics.

**Figure 5.** LLaVA-OneVision-2 data mixtures. (a) Token-volume proportions of the video-caption corpus and (b) the spatial-reasoning corpus used during training. The video-caption mixture contains 104.1B tokens from 7.96M clips spanning four duration buckets, while the spatial mixture aggregates 4M samples drawn from six datasets covering 3D scenes, spatial reasoning, pointing, and referring expressions.

Across all four stages we keep three cross‑stage design choices: a mixed‑batch composition (≈50 % codec‑stream video, 37.5 % uniform video, 12.5 % images), a progressive frame‑budget schedule (30 → 90 → 384 → 768 frames), and a codec‑schedule that activates only in Stage 4.

The JumpScore Benchmark

JumpScore provides 189 densely annotated videos to rigorously test sub‑second temporal grounding.

JumpScore comprises 189 in‑the‑wild jump‑rope videos, offering dense sub‑second cycle annotations.

The clips span multiple indoor scenes, camera angles, and capture devices; resolutions are at least 1280×720, with many at 1920×1080, and each clip contains tens of cycles annotated to decimal‑second precision.

**Figure 6.** Representative cycles from the JumpScore benchmark. Four clips, each decomposed into five frames spanning one jump-rope cycle. The first and last frame of every panel are ground-truth cycle starts (rope behind legs). The four panels span warehouse, office, sports-court, and tiled-corridor captures.

Standard benchmarks such as Charades‑STA, ActivityNet, and QVHighlights target one‑shot event localization where adjacent frames are already distinguishable, whereas JumpScore probes the opposite regime: fine‑grained grounding in high‑frequency, visually near‑identical cycles.

Evaluation Results

LLaVA‑OV‑2 trims wasted compute by tokenizing only high‑motion, high‑residual video frames while keeping the token budget unchanged.

We now present the end‑to‑end evaluation of LLaVA‑OV‑2 across three benchmark domains: video understanding, spatial reasoning, and image & document tasks.

LLaVA‑OV‑2‑8B attains the highest average score on the video benchmark suite, improving the overall video domain average by +4.3 points.

62.5 vs. 58.2 (Qwen3‑VL) on the 18‑task video suite (Table 1).

On the spatial‑reasoning suite, LLaVA‑OV‑2‑8B raises the average by +5.3 points.

86.0 vs. 78.0 (Qwen3‑VL) across 11 spatial benchmarks (Table 2).

JumpScore mAP reaches 74.9, a +44.8 improvement over the strongest 8B baseline.

74.9 vs. 30.1 (Qwen3‑VL) on the fine‑grained temporal‑localization benchmark.

**Figure 7. Codec-stream inputs versus frame sampling.** Under matched visual settings, codec-stream tokenization improves event-level temporal grounding by following high-bit-cost intervals and high-residual regions rather than uniformly sampled frame slots. We evaluate this effect on temporal grounding benchmarks, long-form video QA, and JumpScore, our fine-grained temporal-localization benchmark for high-frequency repeated motion.

**Table.** Performance comparison of LLaVA-OV-2 against other models across various benchmarks, categorized into Spatial Reasoning and Image & Document tasks.

Ablation Study

Ablations isolate the impact of codec‑stream tokenization versus uniform frame sampling.

We isolate the effect of codec‑stream tokenization by swapping it for uniform frame sampling while keeping the backbone, language model, decoder, prompts, and evaluation protocol fixed.

Codec‑stream inputs raise JumpScore by +17.3 points on average compared with uniform frame sampling.

Measured on the JumpScore benchmark under matched visual settings.

Across three temporal‑grounding benchmarks, codec‑stream tokenization yields a +9.7‑point average improvement.

Scores rise from 35.5 to 45.2 when using codec‑stream inputs.

LLaVA‑OneVision‑2‑8B attains 74.9 JumpScore mAP, surpassing Qwen3‑VL‑8B’s 30.1 by +44.8 points.

Both models were evaluated on the same JumpScore benchmark with identical token budgets.

**Figure 8.** Per-skill ablation on VideoMME-v2 at matched visual settings. Teal bars indicate capabilities where codec-stream inputs outperform frame sampling; coral bars indicate capabilities where frame sampling is stronger.

**Table 1.** Performance comparison across different token budgets for LVBench, VideoMME-L (w/ sub), MLVU-dev, and VideoEval-Pro using Fix and Stream strategies.

Comparing a fixed GOP schedule (Fix) to the adaptive Stream approach reveals that the latter matches or improves performance on all four benchmarks, with the clearest benefits on MLVU‑dev and VideoEval‑Pro.

Related Work

Positioning LLaVA-OneVision-2 among five key research directions.

We situate LLaVA-OneVision-2 within five active research strands. Each strand addresses a distinct limitation of current video MLLMs, and our codec‑stream tokenization can be applied to any of them without altering downstream components.

Uniform frame sampling picks a fixed number of equally‑spaced frames from a video, discarding everything in between.

Large‑scale multimodal models that combine a vision transformer backbone, a visual‑to‑language connector, and an instruction‑tuned LLM. The dominant design samples 8–32 uniformly spaced frames and encodes every patch.

Approaches that reduce redundancy by dropping, merging, or compressing tokens either inside the transformer stack or at the input side. Most operate on model activations; a few exploit codec representations.

Techniques that enable video LLMs to locate events in time, typically by adding timestamp tokens or specialized training objectives while still using uniform frame sampling.

Research focusing on maintaining coherent spatial state across long video horizons, often using point‑based supervision or structured spatial benchmarks.

Systems that combine an LLM with a promptable segmentation backbone (e.g., SAM 2) to track a referred object across frames given a language query.

Across all five strands, codec‑stream tokenization replaces uniform frame sampling with a content‑aware, patch‑level selection derived from the video codec. This substitution is orthogonal to model architecture, training objectives, or downstream tasks, allowing any existing system to benefit without redesign.

Video Tracking Results

Codec‑stream tokenization yields a decisive tracking boost without extra tokens.

LLaVA‑OneVision‑2‑8B attains the highest overall tracking score, beating the previous leader Qwen3‑VL‑8B by +10.2 points.

Table 6 reports 41.0 for LLaVA‑OV‑2‑8B versus 30.8 for Qwen3‑VL‑8B.

**Table 5.** Numerical values for the temporal grounding curves in Figure 7. Each benchmark is reported under matched nominal frame budgets.

**Table 6.** Results.

**Figure 9.** Temporal grounding on TimeLens-Bench. Ten cases from Charades-STA, ActivityNet-Captions, and QVHighlights; per row, mean IoU over five runs.

**Figure 10.** R-VOS on ReasonVOS — “Track the animal moving forward”. The two strips show eight evenly-sampled frames of the input RGB (top) and the corresponding SAM2-derived dense mask (bottom) for a 36-frame clip; the per-frame $(x, y)$ tracking points emitted by our model and used as SAM2 prompts are listed below the strips.

**Figure 11.** R-VOS on Ref-DAVIS17 — “Track a sport car”. Same two-strip layout as Figure 10: the top strip is the input RGB and the bottom strip is the SAM2-derived dense mask, with the model’s per-frame $(x, y)$ points printed below.

Case Studies

Real‑world video tasks demonstrate the model’s token‑efficient temporal grounding.

Temporal grounding refers to locating the exact moment a described event occurs within a video. Codec‑Stream tokenization supplies a compact visual token stream that emphasizes high‑motion, high‑residual frames, enabling precise timing without expanding the token budget.

Codec‑Stream sampling attains a mean IoU of 0.894 on JumpScore, far surpassing uniform sampling’s 0.116.

On a single 85‑cycle clip, codec‑stream correctly attributes 82 cycles versus 14 for uniform sampling.

The TimeLens‑Bench evaluation shows near‑perfect alignment (IoU ≥ 0.98) across ten diverse queries, confirming that the model can pinpoint events ranging from simple actions to complex multi‑second activities.

On ReasonVOS, the model emits per‑frame (x, y) points that, when fed to SAM2, produce a dense mask with J&F = 0.939 and HOTA = 0.954, demonstrating robust tracking through pose changes and partial occlusions.

On Ref‑DAVIS17, the same pipeline yields J&F = 0.961 and HOTA = 0.963 despite high‑motion outdoor footage, confirming that the approach generalizes to challenging domains.

In two tabletop robot tasks, the model predicts waypoints that shrink as the gripper nears the target (9→6 for the apple task, 5→3→5 for the bread task), enabling adaptive execution without re‑training.

**Figure 12.** Real-world trajectory predictions on a robot manipulation setup. Cyan polyline = trajectory order; magenta dots = waypoints; green dot = start; z is a normalised depth.

**Figure 13** JumpScore validation: uniform vs. codec-stream sampling on a single 85-cycle clip. At matched visual-token budget, codec-stream sampling attributes 82 of 85 cycle starts (mIoU 0.894) versus 14 of 85 for uniform 128-frame sampling (mIoU 0.116); each predicted cycle start is drawn green when it lands within 0.1 s of a ground-truth start and red otherwise.

**Figure 14.** 2D spatial grounding. Eight examples; predicted point overlaid in red.

**Figure 15.** 3D spatial grounding. Five pick-and-place examples; predicted 3D trajectory overlaid.

Summary and Contributions

The team behind LLaVA‑OneVision‑2 includes core contributors and project leaders.

Core contributors are in bold: Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu.

Project leaders are Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng.

Codec vs Uniform Sampling Details

Comparison of codec‑stream tokenization against uniform frame sampling across frame budgets.

We compare the codec‑stream input representation to standard uniform frame sampling under matched nominal frame budgets. “Uniform” denotes the conventional evenly spaced frame selection, while “Codec” refers to the codec‑stream tokenization used by LLaVA‑OneVision‑2.

**Table.** Comparison of uniform frame sampling versus codec-stream inputs across different frame budgets for QVHighlights, Charades-STA, and ActivityNet Captions benchmarks.

Read the original paper

Open the simplified reader on Paperglide