VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao

VideoKR is a large-scale, expert-curated training corpus designed to bridge the gap between surface-level video perception and knowledge-intensive reasoning.

How can we construct a large-scale, high-quality training corpus to improve knowledge- and reasoning-intensive video understanding?

Current video understanding models struggle with multi-step inference and domain-specific reasoning because existing training datasets focus on everyday activities and simple perceptual tasks. The authors introduce VideoKR, a corpus of 315K expert-domain video examples generated through a skill-oriented, human-in-the-loop pipeline that enforces rigorous quality control and chain-of-thought supervision. Models post-trained on VideoKR outperform prior approaches on knowledge-intensive benchmarks while maintaining competitive performance on general video reasoning tasks.

Paper Primer

The core mechanism is a skill-oriented generation framework that decomposes video understanding into three hierarchical capabilities: basic reasoning, knowledge-enhanced perception, and knowledge-intensive multi-hop inference. This pipeline acts like a curriculum designer: it systematically samples domain-specific knowledge points, retrieves relevant real-world videos, and generates challenging QA pairs that require both visual evidence and external domain expertise to solve.

VideoKR significantly boosts knowledge-intensive reasoning performance in standard-sized models.

Post-training Qwen2.5-VL-7B and Qwen3-VL-8B on VideoKR using a standard SFT-GRPO pipeline. Knowledge-intensive average accuracy increased by 4.7 points for Qwen2.5-VL-7B and 3.0 points for Qwen3-VL-8B compared to base models.

VideoKR-Eval provides a more robust benchmark for genuine video reasoning.

Multi-model single-frame probing and expert re-annotation of filtered videos. Eliminates the "single-frame answerability" bias found in prior benchmarks like MMVU and VideoMMMU.

Why does this paper focus on data design rather than new model architectures?

The authors observe that existing corpora are saturated for current frontier models and lack the depth required for expert-level reasoning. By isolating data as the primary bottleneck, they demonstrate that high-quality, skill-stratified data is sufficient to drive performance gains without needing complex reward engineering.

What is the scope of the VideoKR dataset?

It covers 82 professional subjects across four major disciplines (Natural Sciences, Healthcare, Humanities/Social Sciences, and Engineering) using 145K CC-licensed videos, ensuring legal reusability and clear provenance.

The authors emphasize that "single-frame answerability"—where models solve video tasks by looking at a single static image—is a major flaw in current benchmarks. VideoKR-Eval explicitly filters out these "shortcut" examples to ensure models are actually reasoning over temporal video content.

Researchers should prioritize skill-oriented, expert-curated data over simply scaling up existing, perception-heavy video datasets to achieve deeper reasoning capabilities.

Introduction to VideoKR

We introduce VideoKR, a large-scale corpus to shift video understanding from perception to domain-specific reasoning.

Modern video models excel at surface‑level perception but falter when tasks demand domain knowledge and multi‑hop inference. Existing large‑scale video corpora are biased toward everyday actions and provide little support for knowledge‑intensive reasoning, leading to brittle performance on professional‑level tasks. To close this gap we propose a new data‑centric approach that emphasizes domain‑specific knowledge and stepwise reasoning.

A curated collection of 145 K CC‑licensed videos spanning 82 professional subjects, paired with 315 K QA examples that target progressively deeper reasoning skills.

Tasks that need the model to combine visual evidence with external domain knowledge and perform multi‑step logical deductions.

**Figure 1.** An overview of the VideoKR training corpus. All videos are newly collected and CC licensed, and span a wide range of professional domains. We develop a skill oriented QA synthesis pipeline in which every example is grounded in one of three core skills essential for advanced video reasoning, and examples in the CoT subset are further paired with a high quality reasoning trace.

The shift from general perception to domain‑specific reasoning is the key driver of progress in video understanding.

Constructing the VideoKR Corpus

We build a large, high‑quality video‑QA corpus via a semi‑automated, expert‑audited pipeline.

Building a professional‑grade video‑QA corpus at scale is impossible to do by hand, yet naïve model‑generated data suffer from systematic artifacts. We therefore design a quality‑controlled, semi‑automated pipeline where every model‑produced artifact is inspected and validated by domain experts.

We first assemble a structured repository of domain concepts so that every downstream video can be anchored to a concrete knowledge point.

How does this bank differ from using an off‑the‑shelf knowledge graph?

Public graphs are generic and lack the fine‑grained lecture‑level breakdown we need for video reasoning; our bank is curated per‑discipline, per‑course, and verified by domain experts, guaranteeing relevance to the visual content we later retrieve.

Instead of searching videos with raw textbook terms, we first ask LLMs to invent realistic scenarios that embody each knowledge point, then use those scenarios as search queries.

Why not just search the knowledge‑point term directly?

Direct term searches return lecture recordings that explain the concept but rarely show it in action; scenario‑driven queries surface real‑world footage where the principle is exercised, which is essential for knowledge‑intensive reasoning.

**Figure 2.** (Left) Overview of data construction pipeline. (Right) Statistics of VideoKR-SFT-201K and VideoKR-RL-114K training corpus.

We turn each video into multiple QA pairs that explicitly test perception, knowledge‑enhanced perception, and knowledge‑intensive reasoning, while forcing the model to produce step‑by‑step rationales that are later verified.

Step 1: Sample three frames (t=0 s, t=2 s, t=4 s) and feed them plus the knowledge point to the generation model.

Step 2: Model outputs a multi‑choice question: “What temperature does the water reach when it starts to bubble?” with four options.

Step 3: Model also generates a CoT rationale linking the visual cue (bubbles appearing at frame t=4 s) to the known boiling point (100 °C).

Step 4: Self‑consistency check re‑prompts the model with the same frames and question; it reproduces the same answer and rationale.

Step 5: Two vision‑language models answer the question using only the four frames; both answer incorrectly, confirming visual dependence.

Step 6: An independent verifier checks that each reasoning step cites an observable cue (e.g., “bubbles appear”) and a domain fact (boiling point), and accepts the example.

The example demonstrates how the three filters together guarantee that the final QA truly requires both visual evidence and domain knowledge.

How is this pipeline different from simply prompting an LLM to write QA pairs for a video?

Plain prompting yields uncontrolled style and often ignores the visual stream; our pipeline enforces (i) skill‑aware conditioning, (ii) frame‑level grounding, and (iii) three orthogonal verification steps, eliminating hallucinated or text‑only shortcuts.

**Figure 4.** A VideoKR-SFT-201K example from the natural science domain. The reasoning process is presented in a concise and abbreviated form to improve readability.

**Figure 6.** A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability.

Existing Video Benchmarks

We position VideoKR among prior video datasets and post‑training methods.

Recent benchmarks have broadened video understanding to cover perception, spatiotemporal reasoning, and cross‑modal inference. A newer wave of knowledge‑intensive evaluations demands domain‑specific reasoning beyond surface‑level perception.

A multimodal evaluation suite that measures perceptual acuity, temporal reasoning, and cross‑modal alignment on short video clips.

A benchmark that evaluates multimodal video understanding across diverse domains, emphasizing spatiotemporal coherence.

A vision‑speech‑interaction benchmark that tests models on audio‑visual grounding and cross‑modal retrieval.

A large‑scale video‑question answering suite that probes high‑level reasoning, including commonsense and causal inference.

MMVU evaluates models on videos from specialized domains, demanding that they apply domain‑specific knowledge to answer questions.

VideoMMMU targets expert‑level understanding of lecture‑style videos, requiring models to answer subject‑specific questions with high precision.

SciVideoBench assesses advanced scientific reasoning on videos that illustrate experiments, equations, and technical demonstrations.

A multilingual video‑language understanding benchmark that adapts the MMLU test suite to video contexts.

**Table 1.** Comparison of VideoKR with prior post-training corpora for video understanding. %Video denotes the fraction of video understanding examples, CC indicates whether all videos are Creative Commons (CC) licensed. †: data has not been open-sourced.

The VideoKR-Eval Benchmark

VideoKR-Eval isolates continuous video reasoning with a curated 2,000‑example benchmark.

Multi‑model single‑frame filtering leaves only examples that truly require continuous video reasoning.

Three frontier models (Qwen3‑VL‑235B‑A22B, Claude‑4.5‑Sonnet, GPT‑5.2) agree on 1,254 original QA pairs, which are kept for the benchmark.

**Table.** Performance comparison of different models across various benchmarks, including VidMMMU, MMVU, SciVidBench, and VideoKR-Eval.

Existing benchmarks let models cheat with a single frame, so they miss continuous video reasoning.

Experimental Setup

We describe the controlled post‑training pipeline and the standardized evaluation protocol.

Existing post‑training research often attributes gains to ever‑more sophisticated reinforcement‑learning tricks, but we suspect the real limiter is how the training data are constructed for knowledge‑intensive video reasoning.

Think of the model as a raw stone: first we give it a coarse shape with supervised fine‑tuning, then we polish the edges with reward‑driven optimization (GRPO), keeping the overall process simple so any improvement can be traced back to the data.

How does this GRPO post‑training differ from a typical reinforcement‑learning fine‑tuning that also uses a reward signal?

Standard RL fine‑tuning often interleaves policy updates with the original pretraining objective, which can obscure whether improvements come from the reward or from lingering pretraining dynamics. In our pipeline the model first completes a full supervised pass (SFT), fully committing to the data distribution, and only then receives a single‑epoch GRPO update that directly optimizes the reward on the same data. This clean separation isolates the effect of the reward‑driven step.

Step 1 (SFT): the model sees 2 short video clips, each represented by 8 tokens, and learns to predict the associated QA pairs for one pass.

Step 2 (GRPO): after SFT, we compute ROUGE scores for the generated answers; suppose the first clip gets 0.6 and the second 0.4. The GRPO reward is R = 0.1·0.6 + 0.9·0.4 = 0.43 (using the reward formula from the appendix).

Step 3 (GRPO update): the policy gradient scales the model’s logits by the reward 0.43, performing a single gradient step with learning rate 0.001.

Step 4 (Zero‑RL variant): for the second base model we skip Step 1 and apply the same GRPO update directly on the raw pretrained weights.

The toy run shows that the only difference between the two variants is whether the model first adapts to the data via SFT; any performance gap can therefore be blamed on the data‑driven fine‑tuning stage.

For evaluation we adopt seven benchmarks split into two groups: general video reasoning (Video‑MME, MVBench, LongVideoBench) and knowledge‑intensive video reasoning (VideoMMU, MMVU, SciVideoBench, VideoKR‑Eval). To eliminate reproducibility issues we use each model’s official prompt when available, otherwise the default LMMs‑Eval templates, run three independent samples per model, and report the mean scores.

Main Results and Analysis

Post‑training on VideoKR lifts knowledge‑intensive performance across models.

Current video models lack domain‑specific knowledge and multi‑step reasoning; VideoKR adds a large, skill‑oriented corpus and a rigorous benchmark to address this gap.

Post‑training on VideoKR raises the knowledge‑intensive average of Qwen2.5‑VL‑7B from 41.9 % to 46.6 % (+4.7) and of Qwen3‑VL‑8B from 48.5 % to 51.5 % (+3.0).

Table 3

Increasing the number of input frames at inference improves both general and knowledge‑intensive reasoning; the post‑trained Qwen2.5‑VL‑7B reaches 60.1 % at 16 frames and 65.5 % at 128 frames on general benchmarks, while its knowledge‑intensive score rises from 44.2 % to 46.6 %.

**Table 3.** Benchmark results across general and knowledge-intensive video reasoning. Models are grouped into (i) Other Models and (ii) methods built on Qwen2.5-VL-7B-Instruct or Qwen3-VL-8B-Instruct (with the indicated input Frames). Within each group for (ii), the best score is bold and the second-best is underlined.

**Figure 3.** Inference-time frame scaling results on general and knowledge-intensive video reasoning benchmarks. The figure shows category-wise average accuracies for Qwen2.5-VL-7B-Instruct and its VideoKR post-trained variant (SFT+RL) under different input frame budgets. Appendix D.1 provides the full per-benchmark results for post-trained Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct models.

**Table 4.** Ablation studies on post-training data. All experiments use Qwen2.5-VL-7B-Instruct as the base model, with 128 input frames. The complete results are provided in Appendix D.2.

**Table 5.** Accuracy of Qwen2.5/3-VL models on 3,000 randomly sampled QA examples from various post-training corpora.

Post-Training and RL Details

Details of the post‑training reinforcement learning setup and benchmark prompts.

Group Relative Policy Optimization (GRPO) combines a format reward $R_f$ and an accuracy reward $R_a$ with fixed weights to guide post‑training reinforcement learning.

How does GRPO differ from the standard PPO algorithm used in most RL‑fine‑tuning pipelines?

GRPO replaces PPO’s global baseline with a group‑relative baseline that is computed per minibatch, and it mixes two distinct rewards (format and accuracy) instead of a single scalar reward. This design reduces variance and explicitly enforces output compliance, which PPO does not guarantee.

Compute the weighted format contribution: $0.1 \times 1 = 0.1$.

Compute the weighted accuracy contribution: $0.9 \times 0.6 = 0.54$.

Sum the contributions: $R = 0.1 + 0.54 = 0.64$.

This example shows that even a perfect format cannot compensate for low accuracy; the overall reward is dominated by the accuracy term because of its larger weight.

The post‑training evaluation suite covers a wide range of benchmarks: VideoAuto‑R1, Video‑MME (multiple‑choice), generic “other” benchmarks, and the reasoning‑intensive Video‑R1/VideoRFT prompts. Each prompt follows a consistent template—question, optional answer choices, and a required tag—while reasoning‑heavy tasks explicitly ask the model to think step‑by‑step before answering.

The image displays a film strip containing eight numbered frames depicting historical scenes related to the women's suffrage movement, alongside two Q&A boxes providing analysis of the video content.

Additional Ablation Studies

VideoKR adds domain‑specific knowledge and multi‑step reasoning via a skill‑oriented corpus and benchmark.

We first revisit the premise that VideoKR equips video models with the domain knowledge and multi‑step reasoning needed for professional tasks, then examine how frame count and training data affect performance.

Skill‑Oriented Data Composition drives the bulk of the post‑training improvement.

Full composition (VR + KV + KVR) attains 60.6 % average on general reasoning, whereas the baseline model scores 65.1 %.

Direct‑output supervision outperforms chain‑of‑thought (CoT) prompting.

Direct output yields 65.6 % average on general reasoning, CoT drops to 60.6 %.

VideoKR‑SFT‑201K (our SFT corpus) surpasses alternative SFT corpora.

Our SFT variant reaches 63.2 % average on general reasoning, the next best (Video‑R1‑CoT‑165k) scores 62.9 %.

VideoKR‑RL‑114K (our RL corpus) achieves the highest knowledge‑intensive score.

Our RL variant records 43.0 % average, beating the runner‑up (VideoAuto‑R1‑83K) at 42.4 %.

The case studies illustrate how the model’s reasoning aligns with the intended risk‑mitigation strategies in supply‑chain videos.

Data Construction Details

Details of the corpus construction pipeline, annotators, and quality controls.

We manually reviewed undergraduate curricula from leading universities worldwide and identified 82 representative subjects spanning Natural Sciences, Engineering, Healthcare, and Humanities & Social Sciences. These subjects form the top‑level index of a four‑layer knowledge base (subject → course → lecture → knowledge point), ensuring broad cross‑domain coverage and balanced sampling.

The annotator pool consists of 34 individuals spanning PhD and master students from the four major disciplines. Each annotator contributed to at least one of the five pipeline tasks—Knowledge Bank construction, Seed Example curation, Model Validation, Manual Quality Assessment, and Evaluation Benchmark construction.

Sample entries from the VideoKR corpus illustrate the diversity of reasoning required. For example, a chemistry video yields the Q&A pair “What value is shown on the display at around 01:38?” → “–6.419”, and a physics video asks “Calculate the $Ksp$ of Mg(OH)₂ using the observed pH.” → “$1.0 \times 10^{-12}$”. These examples combine visual observation, domain knowledge, and quantitative calculation.

To avoid model‑specific bias we performed a human‑validated selection of foundation models for each pipeline stage. Seven frontier models were evaluated on 100 representative inputs; a model qualified for a stage only if its total error rate (hard format failures + soft content failures) was ≤ 3 %. The resulting eligibility matrix is shown in Table 8.

Near‑duplicate video filtering mitigates data contamination. Both benchmark and corpus videos are sampled at 1 fps, and a 64‑bit perceptual hash is computed for each frame. Overlapping 20‑second windows (1‑second stride) are indexed; for each benchmark video we retrieve the top‑10 candidate windows and flag an overlap when ≥ 70 % of frames have Hamming distance ≤ 30, after which the matching training video is removed.

**Table.** Categorization of academic disciplines into four major fields: Natural Sciences, Engineering, Healthcare, and Humanities & Social Sciences.

**Table 7.** Biographies of 34 annotators involved in the VideoKR construction pipeline. The table details their participation in: Know. Bank (Domain Knowledge Bank Construction), Seed Ex. (Seed Example Curation), Model Val. (Human-Validated Model Selection), Quality (Manual Quality Assessment), and Eval Bench. (VideoKR-Eval Construction).

**Figure.** A multi-panel educational infographic illustrating a noise mapping exercise. The top-left panel displays a film strip of eight video frames showing the setup, sound measuring app, and noise mapping process. The top-right panel presents a question and answer regarding the highest noise reading and its corresponding color category. The bottom-left panel addresses the physical connection of external speakers to a phone. The bottom-right panel provides a mathematical calculation based on sound pressure level (SPL) readings.

**Table.** Pipeline stage capabilities across different models.

Benchmark Construction Details

Appendix B details the composition of the VideoKR‑Eval benchmark.

We build VideoKR‑Eval from three source benchmarks (MMVU, VideoMMMU, SciVideoBench). After multi‑model single‑frame probing we keep only examples that all three models deem to need continuous video understanding (1,254 retained originals). Experts then re‑annotate 746 new QA items for the filtered videos, yielding a final set of 2,000 high‑quality examples.

**Table 9.** Detailed statistics for the VideoKR-Eval benchmark construction. We retain original examples that are judged to require continuous video understanding by all three single-frame probing models, and add expert-reannotated examples for the filtered videos.

Read the original paper

Open the simplified reader on Paperglide