Trust-Region Behavior Blending for on-Policy Distillation

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov

Trust-Region behavior Blending (TRB) improves on-policy distillation by guiding early student rollouts toward teacher behavior within a strict KL constraint.

How can we improve early-stage on-policy distillation by using a trust-region constraint to blend student and teacher policies during rollout generation?

On-policy distillation (OPD) suffers from poor early-training rollouts, where weak students generate low-quality prefixes that provide unreliable teacher supervision. Trust-Region behavior Blending (TRB) addresses this by replacing the student's early sampling policy with a teacher-guided behavior policy that is explicitly constrained to remain within a local KL trust region of the student. This warmup method anneals the KL budget to zero, ensuring the model returns to pure student rollouts after the initial training phase. Across math-reasoning benchmarks, TRB consistently outperforms vanilla OPD and alternative teacher-injection baselines.

Paper Primer

TRB constructs a behavior policy $\mu^*$ at each prefix by solving a constrained optimization problem: it minimizes the KL divergence to the teacher while keeping the KL divergence to the student below a budget $\epsilon$. This results in a closed-form interpolation between the student and teacher policies, where the interpolation coefficient is determined by the largest feasible value that satisfies the trust-region constraint.

TRB achieves the strongest average performance across multiple math-reasoning distillation settings.

Benchmark comparisons against vanilla OPD, fixed-budget blending, and token-injection baselines (SKD) show TRB consistently yielding higher pass@1 scores. TRB outperforms all compared methods on overall average across two distinct model-pair settings (Qwen3-1.7B/8B and Qwen3-0.6B/4B).

The method's efficiency stems from the local trade-off at each prefix: a small KL budget $\epsilon$ buys a first-order improvement in teacher-closeness while incurring only a second-order cost in behavior-KL divergence from the student. This makes early teacher-guided movement highly efficient for improving the quality of visited prefixes.

Why is a trust-region constraint necessary instead of simply blending the teacher and student policies directly?

Direct blending without a constraint can move the behavior policy too far from the student's current capabilities, potentially creating a distribution mismatch that the student cannot learn from. The trust-region constraint ensures the behavior policy remains within a manageable distance of the student, preserving the learnability of the collected trajectories.

How does TRB differ from existing methods like Speculative Knowledge Distillation (SKD) or Veto?

Unlike SKD, which injects teacher tokens during rollout, or Veto, which modifies the target distribution at visited prefixes, TRB changes the prefix distribution itself by optimizing a behavior policy. It acts one step earlier in the pipeline, focusing on the quality of the prefixes visited during rollout rather than modifying the supervision signal after the fact.

For researchers performing on-policy distillation, TRB provides a robust, annealed warmup strategy that stabilizes early training without requiring persistent off-policy guidance or complex target-side modifications.

Introduction

We expose OPD’s early‑training cold‑start issue and propose TRB to bootstrap student rollouts.

On‑policy distillation (OPD) lets a student generate its own prefixes and then aligns them with a stronger teacher, but the student’s early rollouts are often poor, leaving the teacher to supervise low‑quality data.

OPD is like a student writing draft sentences, then receiving teacher corrections on exactly those drafts rather than on pre‑written examples.

The student samples “B” (probability 0.5).

The teacher predicts “A” with probability 1.0 for the next token.

The student’s predicted distribution (0.2 A, 0.8 B) is compared to the teacher’s, producing a large KL loss.

Early in training the student’s own prefixes can be misleading, so the teacher is forced to correct on noisy data.

How does OPD differ from traditional offline distillation?

Offline distillation trains the student on teacher‑generated prefixes that the student will never see at inference time, creating a distribution shift. OPD eliminates that shift by letting the student generate the prefixes it will actually use, then matching the teacher on those exact trajectories.

To mitigate the cold‑start brittleness, we introduce Trust‑Region behavior Blending (TRB): during an early warm‑up phase the rollout policy is constrained to stay within a KL‑ball around the teacher, then the constraint is gradually removed.

The cold‑start problem in OPD—poor early student prefixes—can be alleviated by a KL‑constrained warm‑up that blends teacher behavior.

Background and Related Work

We situate OPD within prior distillation work and introduce SKD and Fixed‑epsilon Blending as background concepts.

Let $π_S$ denote the student policy and $π_T$ the teacher policy. In On‑Policy Distillation (OPD) the rollout prefixes are drawn from $π_S$ rather than a fixed dataset, and the training objective minimizes the reverse‑KL divergence $D_{KL}(π_θ(\cdot|h)\,\|\,π_T(\cdot|h))$ over student‑generated prefixes $h$. Trust‑Region Behavior Blending (TRB) retains this per‑prefix loss but replaces the sampling policy with a blended behavior policy $μ$, constrained by $D_{KL}(μ \,\|\, π_T)$ for teacher closeness and $D_{KL}(μ \,\|\, π_S) \le \varepsilon$ to keep $μ$ within a student‑centered trust region.

SKD mitigates early‑training mismatch by swapping out low‑quality student tokens with teacher‑approved tokens during rollout generation.

Fixed‑epsilon Blending mixes the student and teacher policies with a constant KL budget, keeping the blended policy within a fixed distance $\epsilon$ from the student.

Trains a student model to match teacher predictions on a fixed data distribution.

Extends KD to generative models by applying distillation on model‑generated sequences.

Samples rollout prefixes from the student policy and minimizes reverse‑KL to the teacher at each prefix.

Derives an on‑policy optimization procedure for reverse‑KL distillation of large language models.

Alters the target distribution at a visited prefix by constructing a bridge between student and teacher logits.

Adds forward‑KL pressure on high‑entropy teacher states to preserve output diversity.

Selects supervision positions based on student entropy and teacher–student divergence.

Introduces intermediate assistants and reasoning traces to bridge the student‑teacher gap in offline chain‑of‑thought distillation.

Trust-Region Behavior Blending

TRB blends student and teacher policies within a KL trust region to guide rollout prefixes.

Early student‑generated rollouts are often poor, which harms on‑policy distillation; the paper therefore needs a way to steer sampling toward the teacher without abandoning the student’s current policy.

TRB keeps the sampling policy inside a KL‑bounded “trust region” around the student while pulling it toward the teacher, so early rollouts get teacher guidance but never stray too far from what the student already knows.

Compute KL($\pi$T‖$\pi$S) ≈ 0.31 > $\epsilon$, so the teacher is infeasible.

Try $\beta$ = 0.5 → $\mu$₀.₅ ∝ $\pi$S^{0.5}·$\pi$T^{0.5} = [√0.7·√0.2, √0.3·√0.8] ≈ [0.374, 0.490]; normalize → $\mu$₀.₅ ≈ [0.433, 0.567].

KL($\mu$₀.₅‖$\pi$S) ≈ 0.14 > $\epsilon$, so $\beta$ is too large.

Binary‑search $\beta$ in [0, 0.5]; try $\beta$ = 0.25 → $\mu$₀.₂₅ ≈ [0.511, 0.489]; KL($\mu$₀.₂₅‖$\pi$S) ≈ 0.07 ≤ $\epsilon$.

$\beta^*$ is the largest feasible value, ≈ 0.40 (interpolated between 0.25 and 0.5). The final blended policy $\mu^*$ ≈ [0.45, 0.55].

Exponentially interpolating the policies yields a distribution that can be tuned precisely to stay inside the KL trust region, something a simple linear mix cannot guarantee.

Given prefix h, retrieve $\pi$S(·|h) and $\pi$T(·|h).

Find $\beta^*$ by binary searching $\beta$∈[0,1] until DKL($\mu$$\beta$‖$\pi$S) ≤ $\epsilon$ (or $\epsilon$k during warmup).

Form the blended policy $\mu^*$ = $\mu$$\beta^*$ using the closed‑form Eq. 2.

Sample the next token a ∼ $\mu^*$ and append it to the prefix.

Repeat until the rollout reaches the desired length.

**Figure 1.** Overview of Trust-Region behavior Blending. At each prefix, the student policy $\pi_S$ defines a KL trust region $D_{KL}(\mu || \pi_S) \leq \epsilon$. TRB then selects the feasible behavior policy $\mu^*$ that is closest to the teacher policy $\pi_T$. The result is teacher-guided behavior that remains close to the student.

How does TRB differ from a naïve linear interpolation $\mu$ = (1‑$\beta$)$\pi$S + $\beta$$\pi$T?

Linear interpolation mixes probabilities additively and does not respect the KL constraint; the resulting distribution can lie arbitrarily far from $\pi$S in KL space. TRB’s exponential interpolation guarantees that for any $\beta$ the KL distance to $\pi$S is a monotonic function, allowing us to pick the largest $\beta$ that still satisfies the trust‑region budget.

Experiments and Results

TRB delivers the strongest final pass@1 scores and the steepest early gains.

We ask whether modest early guidance—via TRB—yields better final OPD performance than vanilla OPD and stronger off‑policy baselines. All methods share the same training protocol; details are in the appendix.

TRB prefixes achieve a +50 % relative success gain over vanilla OPD when continued by the student at prefix length 512.

Figure 4 shows the student‑continuation bar reaching +50 % while the teacher‑continuation bar is only +9 %.

**Figure 2.** Training trajectories on the Qwen3-0.6B-Base $\leftarrow$ Qwen3-4B setup for vanilla OPD, fixed-$\epsilon = 0.01$, SKD ($K = 15$, $\tau_T = 0.2$), and SFT warmup (15 and 25 steps).

**Figure 4.** Relative success gain of TRB prefixes over vanilla-OPD prefixes on the Qwen3-1.7B-Base ← Qwen3-8B setup at step 0, after truncating sampled prefixes at length t and continuing them with either the teacher or the student. Positive bars mean higher success under the same continuation model.

**Table 1.** Benchmark pass@1 results. Bold marks the best result in each column; underline marks the second-best.

Extended Results

TRB adds modest memory overhead during decoding while preserving overall efficiency.

Extended results compare Trust‑Region Behavior Blending (TRB), Sequence Knowledge Distillation (SKD), and vanilla On‑Policy Distillation (OPD) across the two model‑pair settings. Fixed‑$\epsilon$ blending is omitted because it already appears in Table 1 and Figure 2.

Section B.1 reports warm‑up diagnostics for the Qwen3‑1.7B ← Qwen3‑8B pair. As the trust‑region budget $\epsilon$ grows, mean teacher log‑probability and verifier reward increase, while the AUROC of the teacher‑support score drops.

Section B.2 shows a single prompt‑matched rollout from the first warm‑up step. The pure‑student rollout drifts off‑task immediately, whereas the TRB rollout stays aligned with the arithmetic structure of the prompt.

TRB keeps the teacher model alive while the student generates each token, so the extra memory is simply the teacher’s parameters plus its key‑value cache for the current context.

Why does TRB’s memory cost disappear after the warm‑up phase?

During warm‑up the teacher is needed to compute the KL‑constrained blend, so its weights and cache stay resident. Once the interpolation coefficient reaches its final value, the teacher is no longer queried and its state can be freed, reverting memory usage to the OPD baseline.

**Figure 6.** Sweep summary on the Qwen3-0.6B-Base ← Qwen3-4B setup. Each point gives the best-over-training mean score for one hyperparameter setting. TRB points are grouped by warmup horizon and initial budget, SKD points are grouped by $K$ and teacher temperature $\tau_T$, and the dashed red line marks vanilla OPD.

**Figure.** Comparison of model outputs for a math prompt. The left panel shows a "Pure student rollout" exhibiting off-topic drift, while the right panel shows a "TRB rollout" that remains problem-relevant despite some noise.

Section D addresses EOS token mismatches between student and teacher. Both models map their native EOS tokens ($e_S$, $e_T$) to a shared event $e^*$ before sampling, ensuring a single stop event is compared in the KL term.

Derivation of Trust-Region Solution

Derives the trust‑region blending solver that mixes student and teacher policies.

When the student policy is still immature, its sampled prefixes are noisy and hurt distillation; the remedy is to blend the student with the teacher while staying inside a KL trust‑region.

Setting the derivative ∂L/∂$\mu$(a)=0 yields a linear relation in the logarithms of the policies.

Introducing $\beta$ = 1/(1+$\eta$) rewrites the solution as an exponential interpolation between the two policies.

Think of mixing two paints: the student’s color is the base, and the teacher’s color is added in a proportion $\beta$ that is tuned until the mixture stays inside a KL “color‑budget” (the trust‑region).

Compute log‑ratios: r = [log 0.2−log 0.7, log 0.8−log 0.3] ≈ [‑1.252, 0.980].

Pick $\beta$=0.4. Form unnormalized weights: w = $\pi_{S}$ · exp($\beta$ r) ≈ [0.7·$e^{-0.501}$, 0.3·$e^{0.392}$] ≈ [0.7·0.606, 0.3·1.480] ≈ [0.424, 0.444].

Normalize: Z = 0.424+0.444 = 0.868, so $\mu$_$\beta$ = [0.424/0.868, 0.444/0.868] ≈ [0.489, 0.511].

Compute KL to student: `D_KL`($\mu$_$\beta$‖$\pi_{S}$)=0.489·log(0.489/0.7)+0.511·log(0.511/0.3) ≈ 0.12.

Increasing $\beta$ to 0.7 yields $\mu_{0.7}$≈[0.33, 0.67] and `D_KL`≈0.31, confirming monotonic growth.

The blend moves probability mass toward the teacher proportionally to the exponentiated log‑ratio; larger $\beta$ pushes the mixture closer to the teacher while the KL budget grows predictably.

How does this exponential blending differ from a simple linear interpolation $\mu$ = (1‑$\beta$)$\pi_{S}$ + $\beta$$\pi_{T}$?

Linear interpolation mixes probabilities additively, which can violate the KL constraint because the resulting distribution may lie outside the trust‑region. Exponential blending respects the geometry of the KL divergence: the mixture stays on the geodesic between the two policies, guaranteeing that increasing $\beta$ never reduces the KL distance to the student.

Small-Budget Efficiency

Small KL budgets give a first‑order teacher gain at only second‑order student cost.

When the trust‑region budget $\epsilon$ is tiny, moving the rollout distribution $a$ little toward the teacher yields a first‑order reduction in teacher KL while incurring only a second‑order behavior‑KL penalty to the student.

The path reweights the student policy $p$ by the log‑ratio $r=\log q-\log p$, gradually shifting probability mass toward the teacher while staying inside a KL ball.

Changing $\beta$ away from zero perturbs the rollout distribution, but the KL divergence to the original student grows only quadratically for small steps.

Moving a little toward the teacher immediately cuts the KL to the teacher in proportion to the step size.

Enforcing a small KL budget against the student tells us exactly how far we can move toward the teacher.

**Figure 7.** Pooled rollout statistics from the first 25 warmup steps of the Qwen3-1.7B $\leftarrow$ Qwen3-8B setup. Each point corresponds to one trust-region budget $\epsilon$. The horizontal axis shows the mean teacher log-probability on sampled rollouts. The vertical axis shows AUROC for ranking verifier-correct rollouts above verifier-incorrect ones using the sequence-level teacher-support score obtained by averaging $\log \pi_T - \log \pi_S$ over the response. Point color indicates mean verifier reward.

Sequence-Level Control

Enforce per‑token KL limits to keep whole‑sequence divergence under control.

Early in training the student’s own rollouts are noisy, so distillation suffers from a drifting behavior policy.

Instead of letting the student wander arbitrarily, we force every token’s policy to stay inside a tiny KL ball around the teacher; this guarantees the entire generated sequence cannot stray far.

Step 1: $\Delta_1=0.08\le0.1$ satisfies the trust‑region constraint.

Step 2: $\Delta_2=0.05\le0.1$ also satisfies the constraint.

Step 3: $\Delta_3=0.09\le0.1$ satisfies the constraint.

Sum of KLs $=0.08+0.05+0.09=0.22$, which is below the sequence‑level bound $T\epsilon=3\times0.1=0.30$.

Because each token respects the same KL budget, the total divergence never exceeds the linear bound, regardless of how the individual token probabilities fluctuate.

How does token‑level KL clipping differ from clipping the total KL after a full rollout?

Clipping the total KL only checks the aggregate after generation, so early steps can drift arbitrarily and later steps must compensate, which often leads to unstable samples. Token‑level clipping enforces the constraint at every step, preventing drift from ever occurring and guaranteeing the bound holds without post‑hoc correction.

Thus, by bounding the KL locally at each token, the method provides a simple, provable guarantee on sequence‑level divergence while keeping the sampling procedure unchanged.

Discussion and Limitations

TRB’s warm‑up‑only benefit and its practical trade‑offs are examined.

Recall that Trust‑Region Behavior Blending (TRB) blends the student policy with the teacher under a KL‑constrained trust region, preserving the student’s trajectory distribution while nudging it toward the teacher.

Table 1 shows TRB attains the highest average score when it is active only during the warm‑up phase.

When the teacher’s policy is applied continuously, it overwrites the student’s own exploration, preventing the student from learning to recover from its own mistakes; the benefit therefore vanishes once the student’s prefixes have aligned with the teacher.

Figure 2 confirms that a faster early rise or a more direct intervention does not by itself produce the strongest final result.

Figures 3 and 4 reveal that TRB is most helpful while the student’s visited prefixes are still misaligned with the teacher, and that its advantage diminishes once that mismatch disappears.

By moving the behavior policy toward the teacher while explicitly limiting deviation from the student, TRB improves teacher support without fully replacing the student’s trajectory distribution.

Temperature warm‑up also yields more conservative early rollouts, but unlike TRB it does not directly optimize closeness to the teacher under a student‑centered constraint.

Figure 3 tracks teacher token‑mean entropy on the visited prefixes: under TRB the entropy is lower during warm‑up and then largely aligns with vanilla OPD after warm‑up.

Nevertheless, the benchmark curve remains higher for TRB, indicating that the main teacher‑side difference occurs during warm‑up rather than later training.

Figure 4 provides a controlled step‑0 probe of early rollouts; the relative success gain of TRB prefixes over vanilla‑OPD prefixes is positive, confirming higher early success under identical continuation models.

Our study is confined to two math‑reasoning OPD settings with Qwen3‑Base student–teacher pairs and a correctness‑based evaluation protocol, so we do not claim that the same warm‑up schedules transfer unchanged to other domains or teacher–student gaps.

TRB also increases training‑time cost during warm‑up because it requires online teacher decoding and student–teacher co‑residency; a batched teacher pass used in vanilla OPD can be faster in wall‑clock time.

In the settings studied, these costs are temporary: after warm‑up the training reverts to the ordinary OPD runtime profile.

Experimental Details

Details of training stack, hyperparameters, evaluation, and baselines.

We train with the verl pipeline (Sheng et al., 2024) and SGLang (Zheng et al., 2023) for rollout generation, using FSDP2 (Zhao et al., 2023) across eight NVIDIA H100 GPUs.

Each run samples 25,600 prompts from the OpenThoughts3‑1.2M corpus and prepends the system prompt “Please reason step by step, and put your final answer within \boxed{}.” to every input.

We keep the reverse‑KL OPD objective fixed while varying rollout behavior only during warmup; KL is estimated on the student’s top‑k support with $k=16$ tokens, and EOS tokens are canonicalized before KL evaluation.

Rewards are provided by the math‑verify checker: 1.0 for a correct solution and 0.0 otherwise.

**Table 2.** Common training configuration for the blend-based OPD sweeps. Warmup-specific parameters such as blend coefficient schedules, trust-region budgets, and switch-back steps are varied per experiment.

Evaluation uses pass@1, i.e. the fraction of correct solutions among $n$ sampled generations, averaged over problems. We report results on MATH500, AIME24/25, AMC, Olympiad for the Qwen3‑1.7B → 8B setup and on GSM8K, MATH500, AMC, Olympiad for the Qwen3‑0.6B → 4B setup, with generation counts ranging from 32 to 512 per prompt.

**Table 3.** Evaluation configuration.

For the fixed‑$\epsilon$ baseline we keep the KL budget $\epsilon$ constant throughout training. In contrast, TRB schedules $\epsilon$ (initial values $\epsilon_0\in\{0.001,0.005,0.01,0.02,0.05\}$) and solves the per‑prefix teacher strength by bisection, annealing linearly to zero before switching back to pure student decoding.

We compare several baselines: Vanilla OPD (no warmup), Veto (penalizes undesirable outputs with coefficient $\beta_{\text{start}}$), Interleaved teacher injection (SKD) that replaces out‑of‑top‑K student tokens with teacher samples, Temperature warmup (linear schedule from $\tau_0\in\{0.8,0.9,0.95\}$ to 1.0), and Fixed‑$\epsilon$ blending (constant trust‑region budget).

**Table 4.** SFT warmup configuration.

SFT warmup runs supervised fine‑tuning on teacher‑generated responses for up to 50 updates, then uses the checkpoints at 15, 25, 50 updates to initialize ordinary OPD training, matching the prompt order and rollout multiplicity of the online pipeline.

Read the original paper

Open the simplified reader on Paperglide