Kwai Keye-VL-2.0 Technical Report
Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang
Kwai Keye-VL-2.0 uses sparse attention and multi-teacher distillation to enable 256K-context video reasoning.
How does Keye-VL-2.0 achieve efficient long-video understanding and agentic reasoning using a Mixture-of-Experts architecture?
Multimodal models struggle to process hour-long videos because standard attention mechanisms scale quadratically, forcing developers to choose between aggressive frame subsampling or prohibitive memory costs. The authors integrate DeepSeek Sparse Attention (DSA) into a Mixture-of-Experts (MoE) architecture, allowing the model to dynamically select critical tokens and process 256K-length contexts losslessly. To prevent catastrophic forgetting during multi-task training, they use Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) to consolidate specialized agentic and reasoning capabilities. Keye-VL-2.0-30B-A3B achieves state-of-the-art performance on long-video benchmarks like LongVideoBench and Video-MME-v2 while maintaining high efficiency through a 3B-active-parameter MoE backbone.
Paper Primer
The core mechanism is a two-stage training strategy for sparse attention: a dense warm-up phase aligns the indexer with the model's attention distribution, followed by sparse adaptation where the model learns to rely on dynamically selected evidence. This is like a librarian who first learns the entire catalog, then switches to using a high-speed index to retrieve only the relevant pages for any given query.
Keye-VL-2.0 achieves superior long-context video comprehension compared to larger models.
Performance on LongVideoBench and Video-MME-v2 benchmarks. 74.1 on LongVideoBench, outperforming Qwen3.5-35B-A3B (61.6) and InternVL3.5-241B-A28B (53.2).
To resolve the "Multimodal Alignment Dilemma"—where adding new capabilities like tool use degrades foundational reasoning—the authors use MOPD. This system routes student-generated trajectories to one of 13 domain-specific teachers, which provide token-level feedback on the student's on-policy rollouts, ensuring task-specific expertise is distilled without overwriting the base model's reasoning.
Why use a Mixture-of-Experts (MoE) architecture for this specific task?
The MoE design allows the model to maintain a large 30B parameter capacity for reasoning while keeping only 3B parameters active during inference, which is critical for maintaining throughput when processing 256K-length multimodal sequences.
How does this model handle the temporal nature of video compared to standard image-based models?
It treats video frames as independent high-resolution images but prepends a natural-language timestamp to each frame's tokens, allowing the language decoder to perceive temporal order and causality within its native text-processing space.
Researchers can now scale multimodal models to hour-long video inputs without quadratic memory growth, using the provided DSA kernels and MOPD distillation framework to preserve reasoning stability.
Introduction and Performance Overview
Keye‑VL‑2.0 combines MoE and DSA to enable efficient 256K‑token multimodal processing.
Large language models have rapidly expanded into multimodal reasoning, yet existing systems still choke on hour‑level video streams and multi‑task agentic workloads.
Long‑duration video and open‑ended agentic interactions demand both massive context windows and the ability to switch tasks without erasing prior knowledge.
Keye‑VL‑2.0 tackles these problems with two orthogonal advances: Multimodal DeepSeek Sparse Attention (DSA) for linear‑scale context handling, and Cross‑Modal Multi‑Teacher On‑Policy Distillation (MOPD) to keep the MoE backbone from forgetting.
**Figure 1.** Performance Comparison of Keye-VL-2.0-30B-A3B. Our model demonstrates leading capabilities against open-source models (e.g., Qwen3.5-35B-A3B, Qwen3-VL-235B-A22B) and closed-source models (Gemini-3-Flash) across fine-grained temporal localization (ActivityNet, QVHighlights, and Charades under the TimeLens framework) and extreme long-video understanding (LongVideoBench, Video-MME-v2).
Keye‑VL‑2.0 outperforms all compared models on four of six video benchmarks.
Figure 1 shows higher scores for ActivityNet‑TimeLens, QVHighlights‑TimeLens, Charades‑TimeLens, and LongVideoBench.
Keye‑VL‑2.0 delivers practical, high‑quality long‑video understanding and robust agentic capabilities at open‑source scale.
Model Architecture
Encode images at their original resolution, preserving detail while handling variable sizes efficiently.
Standard ViTs resize inputs to a fixed grid, which discards fine details crucial for OCR, document understanding, and video analysis. This loss of resolution hampers downstream performance on tasks where a tiny region determines the answer.
Instead of shrinking an image to a coarse grid, the encoder processes the picture at its native pixel dimensions, preserving both global layout and local details.
Original $2\times2$ embeddings: $p_{00}, p_{01}, p_{10}, p_{11}$.
Interpolation doubles each axis, producing embeddings $p'_{00}, p'_{01}, p'_{02}, p'_{10}, p'_{11}, p'_{12}, p'_{20}, p'_{21}, p'_{22}$ for the $3\times3$ grid that aligns with the $4\times4$ patches.
2D RoPE rotates each query $q$ and key $k$ by angles proportional to their $(x,y)$ coordinates, yielding rotated vectors $q^{\text{rot}}$, $k^{\text{rot}}$.
Attention scores are computed as $q^{\text{rot}}\!\cdot\!k^{\text{rot}}$, preserving relative geometry despite the higher resolution.
Patch n’ Pack packs the four $2\times2$ patches into a single batch entry without padding, feeding FlashAttention for $O(N)$ memory.
Interpolation lets a single set of learned embeddings generalize to any resolution, while 2D RoPE ensures the attention respects the true spatial relationships of the higher‑resolution patches.
Resize input image to its original pixel dimensions; split into non‑overlapping patches.
Interpolate absolute position embeddings to match the patch grid size.
Apply 2D RoPE to queries and keys before attention.
Run multi‑head attention using FlashAttention for efficient memory usage.
Pack variable‑size patch sequences with Patch n’ Pack, forming a dense batch.
Output token sequence is passed to the downstream multimodal model.
Only a small, learned subset of expert sub‑networks processes each token, keeping compute low while scaling model capacity.
How does MoE differ from simply widening a feed‑forward layer?
Widening increases the number of parameters that are evaluated for every token, so compute grows linearly with width. MoE keeps per‑token compute constant by activating only a few experts; the extra parameters reside in dormant experts that are never touched for that token.
Sparse Attention for Long Contexts
DeepSeek Sparse Attention replaces quadratic attention with a token‑wise indexer and group‑wise sparse aggregation.
Full attention scales quadratically with sequence length, making 256 K‑token multimodal contexts infeasible.
The indexer scores every past token against the current query and keeps only the highest‑scoring $k$ tokens, turning a dense $L\times L$ problem into a sparse $L\times k$ one.
Head 1 scores: $[0,\,3,\,1,\,4,\,2,\,5]$ for $s=1\ldots6$.
Head 2 scores: $[1,\,0,\,2,\,3,\,0,\,4]$.
Sum across heads: $[1,\,3,\,3,\,7,\,2,\,9]$.
For query $t=6$, the top‑2 scores are $9$ (token 6) and $7$ (token 4); thus $\Omega_6=\{4,6\}$.
The sparse attention for $t=6$ will attend only to tokens 4 and 6 instead of all six previous tokens.
Top‑$k$ selection collapses the $L$‑by‑$L$ attention matrix to $L$‑by‑$k$, cutting both memory and compute roughly by a factor of $L/k$.
Why not simply use a fixed sliding window instead of a learned top‑$k$ index?
A sliding window forces the model to attend to a predetermined contiguous region, which may miss long‑range dependencies that are semantically important. The learned index scores let the model pick the most relevant tokens anywhere in the past, preserving expressive power while still being sparse.
Attention heads are grouped; each group shares a KV head and attends only to the tokens selected by the Lightning Indexer, preserving the original GQA representation while staying sparse.
How does sharing a KV head within a group differ from independent heads?
Independent heads would each compute separate key/value projections, multiplying the cost by the number of heads. Sharing a KV head means all heads in the group reuse the same projected keys/values, so the only per‑head work is the query projection, cutting both FLOPs and memory.
The model first learns a dense attention pattern, then gradually transfers that knowledge to the sparse indexer, avoiding a cold start where the indexer would have no useful signal.
Why detach the indexer input from the computation graph during training?
Detaching prevents gradients from the sparse KL term from corrupting the dense backbone’s learned representations, ensuring the dense model remains stable while the indexer learns to approximate its behavior.
Pre-Training Curriculum
Progressively scaling training stages give the model stable multimodal abilities before tackling ultra‑long contexts.
Scaling multimodal models to 256 K tokens confronts two problems: quadratic attention cost and catastrophic forgetting of earlier capabilities. A staged curriculum lets the model acquire basic vision‑language alignment before it is asked to reason over ultra‑long sequences.
The training is broken into four progressive stages, each adding a larger context window and richer tasks so the model can solidify earlier skills before facing the next level of difficulty.
Why not train the full model on 256 K tokens from the start?
Training directly on ultra‑long sequences would explode memory (quadratic attention) and destabilize the optimizer because the model would have to learn both low‑level perception and long‑range reasoning simultaneously. The staged approach isolates these learning problems, keeping gradients tractable and preserving earlier skills.
Stage 0: Projector learns a linear map $W_p$ from ViT features (dimension 64) to LLM embeddings (dimension 128) using 100 image‑caption pairs.
Stage 1: All parameters are unfrozen; the model processes 8‑token interleaved image‑text batches, reducing loss from 2.3 to 0.9.
Stage 2: Context length doubles to 16; OCR tokens (“123”) are appended, and the model learns to attend to numeric tokens, lowering OCR error from 18 % to 5 %.
Stage 3: The same 16‑token window now contains two video‑segment descriptors; the model learns to aggregate evidence across the two timestamps, improving video‑QA accuracy by 3 %.
The toy illustrates how each stage adds a new dimension (longer window, new task) while preserving the loss improvements achieved in earlier stages.
Initialize ViT and LLM from pretrained checkpoints; freeze them.
Train the Projector on image‑caption data until alignment loss plateaus.
Unfreeze all modules; train on multimodal batches (image‑text, video‑text, OCR) with a 32 K context.
Increase context to 64 K; add task‑specific data streams (STEM, GUI, grounding, counting, code, tool use).
Further increase context to 256 K; mix long‑context and short‑context samples 1:1.
Continuously clean, deduplicate, and quality‑filter data via a joint Hash + CLIP pipeline.
**Figure.** The training pipeline consists of four sequential stages: Stage 0 (Projector Initialization), Stage 1 (Image-Text Alignment), Stage 2 (Multi-Task Injection), and Stage 3 (Long-Context Extension). Each stage specifies the parameter status of the ViT, Projector, and LLM components, along with the sequence length, data scale, and specific data types used for training.
**Table 1.** Video training setup across pre-training stages.
**Figure 3.** An example of scene-wise dense caption. Each video is decomposed into scenes annotated with timestamps, dense captions, and a global overview.
By the end of Stage 3 the model can ingest 256 K tokens, reason over multi‑hour videos, and retain the fine‑grained capabilities introduced in earlier stages, fulfilling the paper’s efficiency and capability goals.
Post-Training and Instruction Tuning
Post-Training fine‑tunes multimodal abilities into stable instruction‑following behavior.
Pre‑training yields powerful perception and alignment, but without a dedicated fine‑tuning stage the model drifts on language and fails to follow instructions reliably.
After the model has learned to see and align modalities, we expose it to massive instruction data so that its responses become stable, controllable, and faithful to user prompts.
Step 1: Shuffle the 8 k tokens while preserving modality tags.
Step 2: Batch them into groups of 2 k tokens, each batch containing at least one token from every modality.
Step 3: Feed each batch through the pretrained backbone, then compute the instruction‑following loss.
Step 4: Update all parameters jointly; the text‑only portion stabilizes language generation, while video and perception tokens steer visual grounding.
Step 5: After one epoch, the loss on the long‑context snippets drops 15 % relative to a text‑only baseline, showing that the balanced mix preserves context handling.
Even a modest amount of non‑text data forces the model to keep its multimodal pathways alive, preventing the language head from dominating during fine‑tuning.
How does SFT differ from the usual language‑model fine‑tuning that only uses text prompts?
Standard LM fine‑tuning updates a purely textual decoder, leaving the visual encoder untouched. SFT updates the entire multimodal stack on a corpus that deliberately mixes text, video, perception, and long‑context samples, so the model learns to coordinate vision and language while still improving pure‑text instruction following.
Instead of dumping more video frames, we assemble a curated set of tasks that jointly train perception, temporal reasoning, cross‑modal inference, agentic planning, and very long context handling.
Step 1: For the video QA sample, the model receives a question “When does the car turn?” and three temporal clue intervals (e.g., [2‑4 s], [5‑7 s], [8‑10 s]).
Step 2: The model selects the most plausible interval, then the query‑level check verifies that the interval length matches the expected duration.
Step 3: The response‑level check runs a lightweight verifier that compares the predicted answer against a cached ground‑truth snippet.
Step 4: The process‑level check ensures the entire pipeline (question → interval → answer) is internally consistent before the loss is back‑propagated.
Step 5: The perception sample presents a cropped image and asks “What object is centered?”; the model must attend to the visual patch and output the correct label.
Step 6: After one training epoch, accuracy on the video QA rises from 62 % to 71 % while the perception task remains stable, demonstrating that the mixture improves temporal reasoning without harming visual grounding.
The three‑stage verification forces the model to treat temporal evidence as a first‑class object, which translates into more reliable video understanding at test time.
Why isn’t the mixture simply “more video data” if the goal is better video understanding?
Adding raw video frames would increase visual exposure but would not teach the model to reason about temporal evidence or to coordinate with language. The mixture injects structured QA, clue intervals, and cross‑modal checks that explicitly train the model to locate, verify, and reason over video content, which plain video pre‑training does not guarantee.
Specialized Reinforcement Learning
Specialized RL trains domain experts from a shared checkpoint using tailored rewards.
Specialized RL refines a single base policy into multiple domain experts. Each expert receives a reward shaped for its task, allowing the system to acquire fine‑grained capabilities without training separate end models.
The expert learns to place boxes precisely on target objects while avoiding redundant predictions.
Duplicate removal discards B because it overlaps A with IoU≈0.98, leaving A and C.
Hungarian matching pairs A↔G1 (IoU=1.0) and C↔G2 (IoU=1.0).
Min IoU = 1.0 (both targets meet the threshold); mean IoU = (1.0 + 1.0)/2 = 1.0.
Duplicate‑box penalty = 0 (no duplicates remain), so total reward = 1.0.
The example shows how a single stray duplicate would have lowered the mean IoU and incurred a penalty, steering the policy toward compact, non‑redundant outputs.
How does this grounding reward differ from the standard object‑detection loss (e.g., cross‑entropy over class and smooth‑L1 box regression)?
Instead of penalizing each box independently, the reward enforces a global one‑to‑one matching and explicitly rewards the worst‑case IoU, which prevents the model from ignoring hard targets. The duplicate‑box penalty also discourages the “many‑boxes” cheat that typical detection losses tolerate.
The expert is judged by a learned model that checks whether a generated description satisfies the required spatial relation and adheres to a prescribed answer format.
The model answers “left of the blue sphere”.
The generative judge parses the answer, compares the spatial relation to the ground‑truth (red cube is indeed left of the sphere), and returns a correctness score of 1.
The format reward adds 0.2 because the answer follows the required “
Total reward = 1 + 0.2 = 1.2.
The judge’s discrete score captures relational correctness, while the format term nudges the model toward consistent answer phrasing.
Why not use a simple distance‑threshold reward (e.g., negative Euclidean distance) for spatial tasks?
A distance metric cannot express relational concepts like “behind” or “inside” and ignores scene constraints such as occlusion. The model‑judge reward encodes these higher‑level relations and can penalize implausible answers even if the raw distance is small.
The expert receives a reward only when its answer is mathematically equivalent to the reference, regardless of superficial syntactic differences.
The prediction “4x” is parsed into the symbolic form $4x$.
The reference answer “4·x” is also parsed into $4x$.
Canonical forms match, so the reward = 1 (positive).
This shows that superficial notation differences (explicit multiplication sign) do not affect the reward as long as the underlying mathematics is identical.
How does a symbolic‑equivalence reward differ from a standard RL reward based on token‑level BLEU scores?
BLEU measures surface n‑gram overlap and can give credit to mathematically incorrect but lexically similar strings. The symbolic‑equivalence reward ignores wording entirely and only rewards true mathematical identity, preventing the model from exploiting shallow language patterns.
Cross-Modal Distillation
Consolidate heterogeneous RL-trained capabilities using multi-teacher on-policy distillation with dynamic routing and overlap-focused feedback.
Domain-specific post-training often creates conflicting response styles: reasoning-heavy RL might shorten responses, while agentic training might force rigid tool-call formatting. To consolidate these heterogeneous capabilities without style collapse, Keye-VL-2.0 uses Cross-Modal Multi-Teacher On-Policy Distillation (MOPD).
Instead of forcing all domains into one style, MOPD routes each sample to the most relevant RL-trained teacher and distills its feedback only where the student and teacher agree on the plausible next-token distribution.
Identify the overlap set $\Omega = \{ \text{"42", "10"} \}$.
Calculate the student's normalized probability $\bar{\pi}_\theta$ over $\Omega$ by re-weighting its original probabilities for "42" and "10".
Compute the advantage $A_{i,t}$ as the weighted difference in log-probabilities between teacher and student for these two tokens.
By ignoring "43" (teacher-only) and "5" (student-only), the model avoids learning from tokens where the teacher's preference is too far from the student's current policy, keeping the distillation stable.
Training and Inference Infrastructure
Decoupling I/O from training eliminates bottlenecks and unlocks efficient long‑video processing.
Training on video‑heavy batches stalls when the decoder spends most of its time reading raw frames. The I/O path becomes the dominant latency source, leaving GPUs under‑utilized despite abundant compute capacity.
ExtraIO runs a separate, horizontally scalable service that streams decoded frames to the trainer asynchronously, so compute never waits for disk or decoder.
Worker 1 decodes S₁ (3 frames) → buffer slots 0‑2; Worker 2 decodes S₂ (2 frames) → slots 3‑4; Worker 3 decodes S₃ (4 frames) → slots 5‑8 (overflow, so slot 8 waits).
Training step 1 on GPU A pulls the first 3 frames (S₁) from the buffer, leaving slots 0‑2 empty; the buffer now holds slots 3‑5.
Worker 3 continues decoding S₃, filling slot 8; the buffer again reaches capacity, but no stall occurs because GPU B consumes slots 3‑5 (S₂ + first two frames of S₃).
When GPU A finishes its forward pass, it requests the next batch; the buffer instantly provides S₃’s remaining frames (slots 6‑8) without waiting for disk I/O.
By decoupling decoding from the compute loop, ExtraIO smooths out the irregular frame‑count distribution, ensuring both GPU groups stay busy even when individual samples vary widely in length.
ViT–LM heterogeneous parallelism further reduces idle time by letting the vision encoder and language model shard independently, while two‑level load balancing equalizes compute across visual‑token and sample dimensions.
DSA optimizations for variable‑length sequences prune the $T\times T$ score matrix to $T\times\text{max\_seq}$, and short‑sequence shortcuts skip unnecessary top‑k scans, yielding up to 1.5× end‑to‑end speedup on mixed‑length batches.
**Figure 4.** Inference cost of Keye-VL-2.0-30B-A3B. DSA-specific prefill and decode optimizations reduce the cost of ultra-long video inference relative to dense attention under the same H800 pricing assumption.
How does ExtraIO differ from a conventional data loader that simply reads frames on the fly?
A standard loader interleaves decoding with the training step, so a slow video stalls the entire batch. ExtraIO isolates decoding in independent workers, buffers results, and lets the trainer pull ready frames at any pace, eliminating the stall and allowing the I/O service to scale horizontally.
Video Understanding Evaluation
Keye‑VL‑2.0’s video understanding excels across comprehensive, grounding, and knowledge benchmarks.
Keye‑VL‑2.0 combines Mixture‑of‑Experts with DeepSeek Sparse Attention to process 256K‑token video sequences without quadratic cost.
Keye‑VL‑2.0 sets a new state‑of‑the‑art accuracy of 74.1 % on LongVideoBench.
The table shows 74.1 % for Keye‑VL‑2.0 versus 61.6 % for Qwen3.5 and 53.2 % for InternVL3.5.
On Video‑MME‑v2, Keye‑VL‑2.0 attains 35.3 % (64 frames) and 42.4 % (512 frames) accuracy, comparable to dense‑frame baselines. It also remains competitive on MLVU (82.8 %) and Video‑MME without subtitles (78.3 %).
**Figure 5.** Overall evaluation summary of Keye-VL-2.0-30B-A3B. The figure summarizes representative results across video understanding, coding, agentic tool use, mathematical and scientific reasoning, instruction following, and general vision-language benchmarks. Orange scores mark leading results in each row, and “–” indicates unavailable or not directly comparable scores. Higher is better unless otherwise specified by the corresponding benchmark; detailed benchmark descriptions and citations are provided in the subsections below.
Code Agent Evaluation
Keye-VL-2.0-30B-A3B outperforms baselines on code‑agent benchmarks.
Keye-VL-2.0-30B-A3B scores 71.5 on OJBench, beating the nearest competitor by 1.3 points.
Table 3 shows OJBench scores 71.5 (Keye‑VL) vs 70.2 (Qwen3.5‑35B‑A3B).
On LiveCodeBench v6 the model reaches 64.2, surpassing Qwen3.5‑35B‑A3B (62.8). SWE‑bench Verified yields 62.0, close to the best competitor’s 63.5. An additional benchmark records 51.5 for Keye‑VL versus 58.7 (Qwen3.5) and 55.5 (InternVL3.5).
Case Studies: Spatial and Narrative
Qualitative examples illustrate Keye‑VL‑2.0 handling logical, spatial, and anatomical tasks.
Keye‑VL‑2.0 achieves the top scores on $\tau^2$‑Bench (82.6) and VitaBench (33.1), surpassing all compared models.
Table 4 shows Keye‑VL‑2.0 30B‑A3B scoring 82.6 on $\tau^2$‑Bench and 33.1 on VitaBench, higher than Qwen3.5 35B‑A3B and InternVL3.5 241B‑A28B.
While Keye‑VL‑2.0 leads on two benchmarks, it trails Qwen3.5 35B‑A3B on BFCL‑V4, indicating room for improvement on fine‑grained visual‑language alignment.
**Figure 7.** Image case for spatial layout understanding. Given a labeled top-down indoor scene, Keye-VL-2.0 identifies object orientations, egocentric left-right relations, furniture positions, and the direction needed to move an object.
**Figure.** A sequence of seven scenes illustrating a narrative about archaeological discovery and the history of Chinese script. Scene 1 shows a man in a field; Scene 2 shows a highway; Scene 3 shows an elderly man at a desk; Scene 4 shows hands holding an inscribed artifact; Scene 5 shows a historical black-and-white photograph of a rural site; Scene 6 shows two men discussing documents in an office; Scene 7 shows ancient characters for "rain" and "big" alongside their modern counterparts.