Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang

Kimi k1.5 scales reasoning performance by training LLMs with reinforcement learning over long-context windows.

How can we scale reinforcement learning (RL) to improve LLM reasoning capabilities beyond the limits of static pretraining data?

Language models are typically limited by the quality and quantity of static pretraining data, creating a bottleneck for continued intelligence scaling. The authors treat reasoning as a reinforcement learning (RL) problem, using long-context windows to allow the model to perform implicit planning, reflection, and correction through auto-regressive token generation. This approach achieves state-of-the-art reasoning performance, matching OpenAI’s o1 on benchmarks like AIME and MATH 500 while outperforming existing short-CoT models by up to 550%.

Paper Primer

The core mechanism is a simplistic RL framework that avoids complex components like Monte Carlo tree search or explicit value functions. Instead, the model learns to generate long chains-of-thought (CoT) by optimizing a policy gradient objective where the reward is determined by the correctness of the final answer.

To manage the computational cost of long sequences, the system uses "Partial Rollouts," which break long reasoning trajectories into segments across training iterations. This allows the model to reuse previous segments from a replay buffer rather than regenerating entire sequences from scratch.

The long-CoT model achieves state-of-the-art reasoning performance across multiple modalities.

The model scores 77.5 on AIME, 96.2 on MATH 500, and 74.9 on MathVista.

Long-to-short distillation significantly improves the efficiency of short-CoT models.

The distilled short-CoT model achieves 60.8 on AIME and 94.6 on MATH 500. Outperforms GPT-4o and Claude 3.5 Sonnet by up to 550% on specific benchmarks.

Why does the paper avoid using value functions, which are standard in most RL applications?

The authors hypothesize that value functions are unsuitable for long-CoT because they penalize exploratory reasoning steps that appear incorrect in the short term but lead to the correct final answer. By using only the final answer as a reward, the model learns to recover from errors through trial-and-error.

What is the primary role of the 128k context window in this RL framework?

The increased context length acts as an implicit computational budget for planning. It allows the model to store and reference its own previous reasoning steps, enabling it to perform error identification and backtracking without needing an external search tree.

Scaling RL via long-context windows provides a viable alternative to complex search-based reasoning architectures, enabling models to learn sophisticated planning behaviors directly from final-answer rewards.

Abstract and Overview

We expose the data‑scarcity gap and argue that scaling RL with long contexts can close it.

Next‑token pretraining is bounded by the amount of available data, creating a data‑scarcity bottleneck for further scaling. Scaling reinforcement learning (RL) opens a new axis, letting LLMs improve by learning from reward signals rather than raw text. Existing RL‑based LLMs have yet to match the performance of standard pretraining, leaving a clear gap.

Long‑CoT generates a detailed chain of reasoning that occupies a long context, allowing the model to chain many intermediate steps before producing a final answer.

Short‑CoT compresses the reasoning chain into a brief context, forcing the model to capture the essential logical steps within a limited token budget.

Compute all pairwise dot products: $4\times4=16$ scalar products.

Store the resulting $4\times4$ matrix, which consumes $64$ bytes of memory.

If we double the sequence to $N=8$, the matrix grows to $8^2=64$ entries, requiring $256$ bytes—four times the memory.

Memory grows quadratically with sequence length, so naïve attention quickly becomes infeasible for long contexts, motivating the need for efficient RL‑driven scaling strategies.

Scaling Intelligence via RL

We frame the data‑scarcity bottleneck and propose RL‑driven scaling as a new solution.

Scaling reinforcement learning (RL) lets large language models improve reasoning by optimizing against automated rewards, thereby bypassing the data‑scarcity bottleneck of next‑token pretraining. This shift creates a new axis for continued scaling beyond the limits of static datasets.

When models are trained only to predict the next token, they quickly run out of high‑quality data, capping further gains despite larger parameter counts.

Our new model, Kimi k1.5, applies this RL paradigm to a multimodal LLM. By training with rewards rather than a static next‑token objective, we probe whether RL can serve as a fresh scaling lever.

We first extend the context window to $128\text{k}$ tokens and reuse large chunks of previous trajectories (partial rollouts) to avoid regenerating the entire rollout each step. This makes long‑context RL tractable and reveals context length as a key scaling dimension.

For $L=8$, the map contains $8 \times 8 = 64$ entries → $64 \times 4\text{ B} = 256\text{ B}$.

For $L=128\text{k}$, the map contains $128{,}000^2 \approx 1.64 \times 10^{10}$ entries → $1.64 \times 10^{10} \times 4\text{ B} \approx 65\text{ GB}$.

Memory grows quadratically with context length, so naïve attention becomes infeasible at $128\text{k}$ without tricks like partial rollouts.

To make RL stable at these lengths we derive a long‑CoT formulation and employ online mirror descent, a variant of policy optimization that handles high‑dimensional action spaces. Additional tricks—effective sampling, a length penalty, and a curated data recipe—further improve robustness.

The resulting framework is deliberately simple: long‑context scaling plus the improved optimizer yields planning, reflection, and correction behaviors without needing Monte‑Carlo tree search, explicit value functions, or separate reward models.

Because the model is trained jointly on text and vision data, it can reason across modalities, enabling tasks that require both linguistic and visual understanding.

We also devise long‑to‑short techniques: applying a length penalty to long‑CoT activations and merging models to transfer the benefits of long contexts into compact short‑CoT variants.

Empirically, the long‑CoT version matches state‑of‑the‑art reasoning scores (77.5 AIME, 96.2 MATH 500, 94th percentile on Codeforces, 74.9 MathVista), while the short‑CoT variant outperforms existing baselines (e.g., 60.8 AIME, +550 % over GPT‑4o on certain benchmarks).

The shift from static pretraining to RL‑driven scaling unlocks new performance gains without relying on ever‑larger data collections.

Benchmark and Prompt Curation

RL training depends on a prompt set that is diverse, difficulty‑balanced, and reliably verifiable.

The RL prompt set is the primary driver of robust reasoning; its composition directly shapes learning dynamics and guards against reward hacking.

The RL Training Framework

We frame CoT generation as a reinforcement‑learning policy and optimize it with a KL‑regularized objective.

Mapping a problem $x$ to a correct answer $y$ is non‑trivial, especially for complex reasoning tasks. Prior work uses a Chain‑of‑Thought (CoT) sequence $z$ to bridge $x$ and $y$, but generating high‑quality $z$ still requires careful search. Reinforcement Learning (RL) offers a way to treat the whole CoT generation as a policy that can be optimized against a reward signal.

We view the policy model $\pi_\theta$ as a generator of CoT steps and the final answer, and we directly maximize the expected reward while keeping updates close to the previous policy.

Problem $x_1$: policy samples $z^{(1)}$ and answer $y^{(1)}$; the reward model yields $r(x_1, y^{(1)}, y_1^*) = 1$.

Problem $x_2$: policy samples $z^{(2)}$ and answer $y^{(2)}$; the reward model yields $r(x_2, y^{(2)}, y_2^*) = 0$.

Compute the average reward: $\frac{1+0}{2}=0.5$.

Assume the KL divergence between $\pi_{\theta}$ and the reference $\pi_{\theta_i}$ is $0.2$ and $\tau=0.1$.

Regularized objective value: $0.5 - 0.1 \times 0.2 = 0.48$.

The regularizer prevents the policy from over‑fitting to the few successful samples by penalizing large deviations from the previous policy.

How does this RL policy optimization differ from classic REINFORCE?

REINFORCE optimizes the expected reward directly with a stochastic gradient estimator but typically lacks any explicit constraint on policy change. Here we add a KL‑regularization term that keeps each update close to the previous policy, turning the update into an online mirror‑descent step and reducing variance.

Policy Optimization and Regularization

Core off‑policy regularized policy‑gradient with length control.

Standard on‑policy policy‑gradient suffers from high variance and limited exploration when the model must generate long chains‑of‑thought. The authors therefore replace on‑policy sampling with an off‑policy reference policy and add an ℓ₂ regularizer, yielding a more stable learning signal.

Instead of updating the policy with samples it just produced, we keep a stale “reference” policy $\pi$$\theta$ᵢ to generate candidates, then adjust the current policy $\pi$$\theta$ toward those candidates while penalizing large parameter changes.

Compute the unnormalized weights: w₁ = exp(1/0.5)=e²≈7.39, w₂ = exp(0/0.5)=1.

Estimate $\tau$ log Z ≈ $\tau$ log[(0.7·w₁ + 0.3·w₂)/2] ≈ 0.5 log[(0.7·7.39 + 0.3·1)/2] ≈ 0.5 log[ (5.173 + 0.3)/2 ] ≈ 0.5 log[2.736] ≈ 0.5·1.006≈0.503.

Mean reward r = (1 + 0)/2 = 0.5.

Gradient contribution for sample (y₁,z₁): ∇$\theta$ log $\pi$$\theta$(y₁,z₁|x) \cdot (1 - 0.5) - \nabla $\theta$ log[$\pi$$\theta$(y₁,z₁|x)/$\pi$$\theta$ᵢ(y₁,z₁|x)].

Gradient contribution for sample (y₂,z₂): ∇$\theta$ log $\pi$$\theta$(y₂,z₂|x) \cdot (0 - 0.5) - \nabla $\theta$ log[$\pi$$\theta$(y₂,z₂|x)/$\pi$$\theta$ᵢ(y₂,z₂|x)].

The mean‑reward baseline automatically cancels out the constant 0.5 term, leaving only the advantage (r − r̄) to drive updates while the KL term keeps the policy close to its previous version.

How does this off‑policy loss differ from a standard on‑policy REINFORCE loss?

Standard REINFORCE samples actions from the current policy and uses a separate value network as a baseline. Here we sample from a frozen reference policy, replace the value network with the empirical mean reward, and add an explicit KL penalty that directly ties the update magnitude to the distance from the reference policy.

To keep generated reasoning chains from exploding in size, the trainer adds a small reward that favors shorter correct answers and penalizes long incorrect ones.

Compute $\lambda$ for each response: $\lambda$₁ = 0.5 − (12−12)/(24−12)=0.5; $\lambda$₂ = 0.5 − (18−12)/12=0.0; $\lambda$₃ = 0.5 − (24−12)/12= −0.5.

Apply length reward: `len_reward`₁ = 0.5 (because correct); `len_reward`₂ = 0.0 (because correct but not shorter); `len_reward`₃ = min(0, −0.5)= −0.5 (because incorrect and longest).

Final reward added to the original binary reward: [1+0.5, 1+0.0, 0−0.5] = [1.5, 1.0, −0.5].

The scheme preserves the advantage of correct answers while actively discouraging wasteful verbosity, especially when the answer is wrong.

Why not simply truncate responses to a fixed length instead of using a learned penalty?

Hard truncation discards potentially useful reasoning steps and creates a non‑differentiable loss. The length penalty provides a smooth gradient signal that still rewards brevity but allows the model to keep essential intermediate steps when they improve the final answer.

Beyond reward shaping, the authors improve data efficiency by steering which problems are presented to the learner.

Start with easy problems to bootstrap competence, then gradually expose harder examples; simultaneously focus on problems the model still fails on.

How does prioritized sampling avoid over‑focusing on a few hard problems?

Success rates are updated continuously; as the model improves on a previously hard problem, its sᵢ rises, reducing its sampling probability. This dynamic re‑balancing prevents the curriculum from getting stuck on outliers.

**Figure 10.** Comparison with using ReST for policy optimization.

Automated Reward Generation

We generate automated test‑case rewards to train the model without hand‑crafted data.

Coding problems often lack test cases, so we need an automated way to generate them as RL rewards. This eliminates the manual effort of writing judges for each problem.

Instead of hand‑crafting test cases, we let a language model synthesize them and validate against multiple ground‑truth solutions, turning the validation outcome into a reward signal.

How does Automated Reward Generation differ from traditional reward shaping in RL?

Traditional shaping adds a handcrafted scalar to the environment reward, whereas Automated Reward Generation creates concrete test cases and uses the binary pass/fail outcome of those cases as the reward, removing the need for manual design.

Provide the problem description and the CYaRon generator interface.

Generate 50 candidate test cases using the base Kimi k1.5 model.

Randomly sample 10 ground‑truth submissions for each candidate.

Execute each test case on the 10 submissions.

Retain a test case if at least 7 submissions produce matching results.

Accept the problem into the training set if ≥9 of the 10 sampled submissions pass all retained test cases.

Step 1: Candidate test cases T₁…T₅ are produced.

Step 2: For each Tᵢ we run submissions S₁, S₂, S₃, obtaining outputs O₁ᵢ, O₂ᵢ, O₃ᵢ.

Step 3: T₁ yields outputs (5, 5, 5) → 3/3 agree → keep T₁.

Step 4: T₂ yields (7, 8, 7) → 2/3 agree → keep T₂.

Step 5: T₃ yields (2, 3, 4) → 1/3 agree → discard T₃.

Step 6: After filtering, we retain T₁, T₂, T₄, T₅ (4 test cases).

Step 7: The problem is added to the training set because 2 of the 3 sampled submissions pass all retained test cases (≥2/3 threshold).

Even with a tiny toy instance, the majority‑vote filter quickly discards noisy candidates while preserving high‑quality test cases.

From a random sample of 1 000 online contest problems, 614 do not require a special judge. We built 463 generators that each produced ≥40 valid test cases, yielding 323 problems that entered the training set.

Because the Chain‑of‑Thought reward model far outperforms the classic version, we adopt it as the feedback signal during RL training, ensuring more reliable correctness judgments.

Vision data for RL comes from three sources: real‑world images with scientific questions, synthetically generated scenes for spatial reasoning, and text‑rendered screenshots that let the model handle text‑heavy visuals consistently.

Context Compression via Long2Short

Key training variations and their impact on token efficiency.

Instead of feeding a long chain‑of‑thought (Long‑CoT) to the model at test time, we compress the reasoning into a short prompt that preserves the essential “thinking steps” while drastically reducing token usage.

How is Long2Short Context Compression different from simply truncating a long response?

Truncation discards information arbitrarily, often removing essential reasoning steps. Long2Short compression, by contrast, selects the shortest *correct* answer from a set of generated traces, preserving the logical conclusion while discarding only redundant wording.

Model merging combines a Long‑CoT checkpoint with a Short‑CoT checkpoint by averaging their weights. Removing this merging step reduces token efficiency, as the resulting model lacks the compact reasoning distilled from the long‑CoT teacher.

Shortest rejection sampling draws multiple responses for each query and keeps the shortest correct one for supervised fine‑tuning. If this sampling is omitted, the short‑CoT model is trained on longer, noisier answers, leading to higher token consumption at inference.

DPO forms preference pairs by treating the shortest correct response as positive and longer (or incorrect) responses as negative. Without DPO, the model receives no explicit signal to prefer brevity, so token usage rises while accuracy remains comparable.

Long2short RL adds a length penalty and caps the maximum rollout length after the standard RL phase. Eliminating this second RL stage leaves the model free to generate arbitrarily long answers, sacrificing the token‑efficiency gains demonstrated in the ablations.

The pretraining pipeline proceeds in three stages: vision‑language pretraining, a cooldown phase that consolidates multimodal skills, and a long‑context activation that extends sequence length to 131 k tokens. Each stage contributes distinct capabilities, and skipping any stage degrades either visual grounding or long‑context reasoning.

Vanilla supervised finetuning builds a 1 M‑example text corpus (500 k QA, 200 k coding, 200 k math/science, 5 k creative, 20 k long‑context) and a matching 1 M text‑vision set. Removing any of these sub‑datasets reduces coverage of the target domains, leading to poorer performance on the corresponding tasks.

The RL infrastructure consists of rollout workers that generate trajectories, a master reward model that scores them, a replay buffer that stores high‑quality samples, and separate policy and reference models for updates. Disabling partial rollouts—early stopping based on length constraints—would increase compute cost without improving final performance.

**Figure 7.** Long2Short Performance. All the k1.5 series demonstrate better token efficiency compared to other models.

Large-Scale RL Infrastructure

Scaling RL for LLMs needs a system that keeps every GPU busy while handling long reasoning traces.

Full‑trajectory rollouts of long chain‑of‑thought (CoT) prompts can dominate compute, leaving many GPUs idle while a single rollout monopolises the system.

The system decouples generation (rollout workers) from learning (trainer workers) via a central master and a replay buffer, so that any worker can keep a GPU busy regardless of how long a single trajectory is.

How does this architecture differ from a naïve on‑policy RL loop that rolls out every episode end‑to‑end?

In a naïve loop the policy would wait for a full trajectory before any gradient is computed, causing long‑CoT episodes to block all GPUs. Here the replay buffer lets trainers consume whatever partial segments are ready, while the master keeps the pipeline flowing.

Think of a long book as a series of bookmarked pages: the system reads a few pages, stores the bookmark, and resumes later, so the reader never has to hold the whole book in memory.

Iteration 1: rollout worker emits the first five tokens “t₁ t₂ t₃ t₄ t₅” and stores the remaining suffix “t₆ t₇ t₈” in the replay buffer.

Iteration 2: a different rollout worker fetches the saved suffix, continues the generation, and produces “t₆ t₇ t₈”.

Trainer workers sample the completed 5‑token segment for gradient computation while the 3‑token suffix waits in the buffer.

When the next rollout request arrives, the buffer supplies the 3‑token suffix, avoiding any re‑generation of “t₁ … t₅”.

Partial rollouts turn a monolithic 8‑token generation into two cheap, independent pieces, keeping all GPUs busy and cutting the per‑iteration compute roughly by $B / \text{total tokens}$.

Why can’t we simply truncate long responses instead of using partial rollouts?

Truncation discards the tail of the answer, which in RL is the part that the reward model evaluates for correctness. Partial rollouts preserve the full answer by stitching together segments, so the model still receives feedback on the entire reasoning chain.

**Figure 3.** Large Scale Reinforcement Learning Training System for LLM

**Figure 4.** Hybrid Deployment Framework

Performance and Scaling Results

Scaling RL improves reasoning while keeping token use efficient.

Long2short RL algorithm achieves the highest token efficiency, reaching a Pass@1 of 88.2 on MATH‑500 while using the same token budget as other short models.

k1.5‑shortest model attains 88.2 Pass@1 on MATH‑500 with token usage comparable to other short variants.

The series groups several variants (long‑CoT, short‑CoT, shortest) that share a common backbone but differ in training regime and token budget, enabling systematic study of scaling and token efficiency.

How does the Kimi k1.5 series differ from other LLM families like GPT‑4 or LLaMA?

Unlike those families, the Kimi k1.5 series is built from a single base model that is repeatedly fine‑tuned with distinct objectives (RL, DPO, merging) to produce a family of size‑matched variants, allowing direct comparison of token efficiency and reasoning performance under controlled conditions.

**Figure.** Performance comparison of Kimi k1.5 long-CoT against various models across Math, Code, and Vision benchmarks.

**Figure 2.** Kimi k1.5 short-CoT results.

**Table 3.** Performance of Kimi k1.5 short-CoT and flagship open-source and proprietary models. VLM model performance were obtained from the OpenCompass benchmark platform (https://opencompass.org.cn/).

**Figure 6.** Model Performance Increases with Response Length

**Figure 8.** Model Performance vs Response Length of Different Model Sizes

Training a mid‑sized model up to a 128 k context length yields steadily longer responses and higher accuracy; harder benchmarks exhibit steeper length growth, indicating the model learns to elaborate more on complex problems.

The long2short RL algorithm outperforms DPO, shortest‑rejection sampling, and model‑merge baselines in token efficiency, achieving the best Pass@1 scores while keeping token budgets comparable.

Ablations show (1) extending context length can let a smaller model match a larger one’s performance, (2) incorporating negative gradients improves sample complexity over ReST, and (3) curriculum sampling beats uniform sampling by progressively focusing on harder questions.

Summary of Contributions

We summarize our key contributions on long‑context RL training and policy optimization.

We introduce k1.5, a multimodal LLM trained with reinforcement learning, and highlight that scaling context length is the primary driver of continued performance gains. Our system incorporates optimized learning algorithms and infrastructure tricks such as partial rollouts to make long‑context RL training practical.

We combine several techniques to strengthen policy optimization, notably formulating long‑CoT RL with LLMs and deriving a variant of online mirror descent for robust updates. We also explore sampling strategies, apply a length penalty, and refine the data recipe to boost RL performance. These improvements enable strong results without resorting to more complex methods such as Monte Carlo tree search or explicit value functions.

We observe that long2short methods markedly improve short‑CoT models and can be iteratively combined with long‑CoT RL to increase token efficiency within a fixed context budget. Future work should address better credit assignment and reduce overthinking while preserving the model’s exploration capabilities.

Data Annotation and Strategy

We disclose the data pipeline that underpins our multimodal pretraining.

Data Annotation lists the contributors in alphabetical order by first name; an asterisk marks authors who have left the team.

Reinforcement‑learning (RL) efficiency depends on the quality of the underlying base model, and recent frontier models demonstrate that superior pretraining data is essential for high performance. Many open‑source projects hide their data pipelines, which hampers community reproducibility. We therefore provide a full disclosure of our multimodal pretraining data recipe.

Our pretraining corpus spans five domains—English, Chinese, Code, Mathematics & Reasoning, and Knowledge—each subjected to domain‑specific filtering and validation to guarantee high‑quality inputs.

For English and Chinese text we apply a multi‑dimensional quality framework: rule‑based heuristics remove duplicates, machine‑translated text, and spam; FastText classifiers assess linguistic coherence; embedding similarity scores prune near‑duplicates while preserving semantic variety; and LLMs score documents for coherence, informativeness, and educational value. The final document score drives dynamic sampling, up‑sampling high‑quality texts and down‑sampling lower‑quality ones during training.

Code data are processed in two stages: pure code files undergo rule‑based cleaning and language‑imbalance correction—down‑sampling markup languages (JSON, YAML, YACC) and up‑sampling 32 major programming languages such as Python, C, C++, Java, and Go. Text‑code interleaved samples are retrieved with an embedding‑based method to retain diversity and quality.

Mathematics & Reasoning data require specialized OCR pipelines; we first apply a FastText filter to discard obvious noise, then a fine‑tuned language model performs a second pass, yielding a high‑precision mathematical dataset.

Knowledge documents—exercises, textbooks, and research papers—are annotated with OCR quality metrics, educational‑value scores, and document‑type labels. We filter out low‑quality OCR outputs, prioritize high‑pedagogical relevance, and then apply a sampling strategy that up‑samples the most valuable subsets while preserving a balanced representation of other types.

Read the original paper

Open the simplified reader on Paperglide