Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu

CPPO replaces uniform token-level trust regions with position-weighted thresholds and cumulative prefix budgets.

How can we improve LLM reinforcement learning by replacing uniform trust-region constraints with position-aware, cumulative budget-based masking?

Reinforcement learning for LLMs often uses uniform trust-region thresholds across all tokens, which fails to account for the autoregressive nature of generation. Early token deviations propagate through the entire remaining sequence, while late-stage exploration is unnecessarily constrained by the same static limits. CPPO (Cumulative Prefix-divergence Policy Optimization) introduces a position-weighted threshold that tightens constraints at early positions and a cumulative prefix budget that dynamically restricts updates as historical drift accumulates. This structured approach consistently outperforms uniform baselines, achieving significant gains in reasoning accuracy across multiple model scales.

Paper Primer

CPPO aligns policy updates with the autoregressive structure of LLMs by replacing pointwise divergence limits with two coupled constraints. It uses a decreasing position weight to enforce stricter limits on early tokens—where deviations cascade into larger sequence-level shifts—and a cumulative prefix budget that tightens the trust region as the generated history drifts from the rollout policy.

CPPO significantly improves reasoning accuracy and training stability compared to uniform trust-region methods.

Across four Qwen3 model settings, CPPO consistently achieved the highest AIME24/25/26 average scores, outperforming the matched DPPO baseline by up to 5.56 absolute points. 3.06 to 5.56 absolute percentage point improvement over the second-best method.

The performance gain is driven by the structured allocation of the divergence budget rather than the divergence metric itself. Ablations confirm that the position-weighted schedule and the cumulative prefix budget provide independent, complementary improvements, and that the autoregressive order of these weights is critical to the observed stability.

Why does a uniform threshold perform poorly in LLM reinforcement learning?

Uniform thresholds ignore autoregressive asymmetry: early tokens condition the entire subsequent response, so identical divergence at early positions causes much larger sequence-level drift than at late positions. Furthermore, they ignore cumulative prefix drift, allowing the model to continue deviating even after the historical context has already shifted significantly from the rollout policy.

How does CPPO differ from existing trust-region methods like DPPO?

While DPPO and CPPO use the same Top-K reduced-TV divergence measure, DPPO applies a uniform threshold across all tokens. CPPO uses the same divergence measure but masks updates based on a position-dependent schedule and a cumulative prefix budget, ensuring that the total allowed deviation is managed relative to the generation history.

Abstract

CPPO introduces position‑aware thresholds and a cumulative prefix budget to stabilize RLVR training and boost reasoning accuracy.

RLVR is now the standard way to train LLMs for reasoning, but existing PPO‑style trust‑region methods apply the same divergence limit to every token regardless of position. This uniform treatment ignores the autoregressive asymmetry—early token deviations compound across the whole sequence, leading to under‑regulation early and over‑constraining later tokens—and it also neglects cumulative prefix drift, giving the same allowance no matter how far the generated prefix has already diverged. We propose CPPO, which uses a position‑weighted threshold to tighten early‑position limits and a cumulative prefix budget to cap total drift, yielding more stable training and higher reasoning accuracy across model sizes.

The Problem with Uniform Trust Regions

We expose why uniform token-level trust regions destabilize autoregressive RL for LLMs.

Standard RL trust‑region mechanisms apply the same divergence limit to every token, but in autoregressive generation early tokens shape the entire future, so a uniform bound underestimates early drift and over‑constrains later steps. Moreover, each token’s divergence accumulates along the generated prefix, meaning the policy may already be far off‑policy before the current update, yet the same static threshold is still applied. We formalize token position in the finite‑horizon error bound and evaluate the CPPO token‑level mask under matched RLVR settings, showing that position‑aware constraints improve training stability.

RLVR lets an LLM generate a response, then a verifier checks its correctness and returns a scalar reward, which the policy optimizes via a PPO‑style token‑level objective.

A uniform trust region caps the divergence between the current policy and the rollout policy by the same threshold for every token, regardless of its position in the generated sequence.

Uniform constraints fail in autoregressive generation because they miss early‑token impact and cumulative prefix drift.

Formalizing Sequential Decision Problems

Finite‑horizon RL for LLMs and the token‑level trust‑region tools it relies on.

Reinforcement learning for LLMs is a finite‑horizon sequential decision problem. Given a prompt $x$, a policy $\pi$ generates a response $y=(y_1,\dots,y_T)$ token by token; at step $t$ the state is $s_t=(x,y_{<t})$ and the action is the next token $y_t$. After the full response a verifier returns a scalar reward $R(x,y)$ and the training objective is $J(\pi)=\mathbb{E}_{x,y\sim\pi}[R(x,y)]$.

TV divergence quantifies the worst‑case probability shift between two token distributions – the largest amount of mass that must be moved to turn one distribution into the other.

Compute absolute differences: $|0.8-0.5|=0.3$ for $a$, $|0.2-0.5|=0.3$ for $b$.

Sum them: $0.3+0.3=0.6$.

Apply the $\tfrac12$ factor: $D_{\text{TV}}=0.3$.

Even though the target policy redistributes probability evenly, the TV distance is only $0.3$, showing that a modest shift can already move a large fraction of mass.

Uniform token‑level thresholds treat every position identically, but early tokens influence the entire suffix while late tokens affect only a short tail. Moreover, a sequence of small per‑token deviations can accumulate into a large drift of the conditioning prefix, which a static threshold cannot detect.

The CPPO Mechanism

CPPO replaces uniform trust regions with position‑weighted thresholds and a cumulative prefix budget.

Uniform token‑level trust regions treat every generation step alike, over‑penalising late tokens while under‑constraining early ones that steer the whole suffix.

CPPO reshapes the trust‑region budget so that early divergences are tightly capped and later divergences are gradually relaxed, while also limiting the cumulative drift of the generated prefix.

Because an early token influences a longer suffix, we weight its allowed divergence by a decreasing factor wₜ that mirrors the remaining‑horizon multiplier $\lambda$ₜ.

Instead of checking the divergence only at the final sequence, CPPO requires every prefix to respect a weighted average budget $\delta_{b}$, preventing early bursts of error from being hidden by later relaxation.

Enforcing the prefix budget at every intermediate step requires a per‑token mask that adapts to the accumulated weighted divergence.

The mask keeps an update only if the current weighted divergence stays below both the static token‑level cap $\delta$ and the dynamic budget left after the previous prefix.

Initialize S₀←0, W₀←0.

For t = 1 to T:

Compute $w_t$ = 1 - (t‑1)/(T‑1)·(1‑$w_{\min}$).

Set $Z_t$ = $w_t$·$D_t$.

Compute $c_t$ = min{ $\delta$, $\delta$ + $\delta_{b}$·$W_{t-1}$ – $S_{t-1}$ }.

If $Z_t$ ≤ $c_t$ then $M_t$←1 else $M_t$←0.

Update $S_t$ ← $S_{t-1}$ + $Z_t$, $W_t$ ← $W_{t-1}$ + $w_t$.

Return the mask $M_{1:T}$.

t=1: w₁=1.0, Z₁=1·0.05=0.05, c₁=min{0.2, 0.2+0.15·0‑0}=0.2 → M₁=1; S₁=0.05, W₁=1.0.

t=2: w₂=0.85, Z₂=0.85·0.12≈0.102, c₂=min{0.2, 0.2+0.15·1‑0.05}=0.25 → M₂=1; S₂≈0.152, W₂≈1.85.

t=3: w₃=0.70, Z₃=0.70·0.18≈0.126, c₃=min{0.2, 0.2+0.15·1.85‑0.152}≈0.227 → M₃=1; S₃≈0.278, W₃≈2.55.

t=4: w₄=0.55, Z₄=0.55·0.22≈0.121, c₄=min{0.2, 0.2+0.15·2.55‑0.278}≈0.284 → M₄=1; S₄≈0.399, W₄≈3.10.

t=5: w₅=0.40, Z₅=0.40·0.30=0.12, c₅=min{0.2, 0.2+0.15·3.10‑0.399}≈0.266 → M₅=1; S₅≈0.519, W₅≈3.50.

The mask stays 1 throughout because the cumulative weighted divergence never exceeds the shrinking budget, illustrating how early tight weights prevent runaway drift while later tokens are allowed more slack.

With the mask in place, CPPO simply plugs Mₜ^{CPPO} into the standard PPO surrogate, preserving all other components of the loss.

Empirical Evaluation

CPPO outperforms all baselines, delivering up to 5.56 points higher validation scores.

DPPO limits the per‑token divergence between the rollout policy and the target policy, enforcing a trust region that keeps updates within a prescribed KL/TV budget.

How does DPPO differ from standard PPO’s clipping rule?

Standard PPO clips the probability ratio $r$ to stay near 1, which bounds policy change indirectly. DPPO replaces that heuristic with an explicit divergence constraint (KL or TV), guaranteeing that the new policy’s token‑level distribution cannot drift beyond the prescribed budget.

CPPO achieves the highest validation AIME24/25/26 Avg@16 score across all four Qwen3 settings.

On the post‑trained 1.7 B model CPPO reaches 54.79 % while the next‑best method scores 49.23 %.

The results (Table 1) show CPPO consistently leading the leaderboard, with gains ranging from under one point on the smaller models to several points on the largest model. The improvement persists despite matching DPPO on the underlying divergence estimator and threshold scale, confirming that the position‑weighted and prefix‑budget constraints are the source of the advantage.

**Table 1.** CPPO consistently outperform all the baselines across all settings by a significant margin. Specifically, CPPO attains the best AIME24/25/26 Avg@16 in all settings, reaching 31.88, 12.78, 31.11, and 54.79 on Qwen3-1.7B, Qwen3-1.7B-Base, Qwen3-8B-Base, and Qwen3-30B-A3B-Base, respectively. The margins over the second-best method are 3.06, 0.91, 1.39, and 5.56 absolute points. The performance of the baselines varies across models. CISPO attains the second highest validation performance on the 1.7B models, while MinPRO and DPPO rank second on 8B-Base and 30B-A3B-Base respectively.

**Figure 4.** Validation AIME24/25/26 Avg@16 curves for the three Base-model runs.

Ablation studies (Figure 5) dissect CPPO’s two constraints. Removing the position‑weight (wₜ) or the prefix‑budget ($\delta_{b}$) each degrades performance relative to the full CPPO mask, confirming that both components contribute independently.

**Figure.** Ablation studies comparing CPPO variants against DPPO-TV across three different experimental setups: Single Mechanism Ablation, Position-Weight Ordering, and Mask vs. Soft Gate. All plots show AIME24/25/26 Avg@16 performance over 1000+ training steps.

**Binary vs. Top-K approximation.** The right panel replaces the Top-K reduced-TV score with the simpler Binary-TV partition used by DPPO. Both configurations maintain similar validation scores and consistently outperform DPPO. This indicates the performance improvement is robust to the choice of divergence metric and its approximation granularity, aligning with the DPPO ablation of Qi et al. (2026) that finds Binary and Top-K estimators yield comparable results. The prefix budget, not the divergence estimator, drives the improvement.

**Table 2.** Trust-region methods grouped by the statistic they use and where the constraint is applied. Here token-level TV/KL denotes the divergence between rollout and target next-token distributions at a sampled prefix. In the matched comparison, DPPO and CPPO use a matched Top-K reduced-TV statistic; prefix-ratio objectives such as MinPRO are not divergence-budget methods.

**Table 4.** Detailed per-benchmark evaluation results. This table expands the aggregate results from Table 1 (%, Avg@16). AVG denotes the best AIME24/25/26 Avg@16 within the matched evaluation window; the per-benchmark columns report the AIME24/AIME25/AIME26 scores at this best-average checkpoint, not their individual maxima. *Collapse* indicates training divergence, which only occurred for CISPO on Qwen-30B-A3B-Base at step 215. Within each model block, the best score in a column is in **bold** and the second-best is <u>underlined</u>.

Theoretical Foundations

Formal proofs of the CPPO policy‑improvement guarantees.

We now present the full theoretical arguments that justify CPPO’s stability claims. The lemmas and proposition establish intermediate bounds, culminating in the main policy‑improvement theorem.

Under common support, let $\rho_{a:b} := \rho_{T+1:T} := 1$. Then $J(\pi) - J(\mu) = L'_{\mu}(\pi)\,\Delta(\mu,\pi)$, where $L'_{\mu}(\pi) := \mathbb{E}_{\mu}[\Delta(\mu,\pi)]$ and $\Delta(\mu,\pi) := \mathbb{E}_{\mu}[R(x,y)]$. Consequently $J(\pi) - J(\mu) \ge L'_{\mu}(\pi)\,\Delta(\mu,\pi)$.

Rewrite the sum of importance‑sampling ratios using telescoping.

Substitute the telescoped expression into the performance difference.

Derive the lower‑bound.

If $D_j \le \ell_j$ pathwise for all $j>t$, then for any $s_{t+1}$, $\operatorname{DTV}\bigl($P_{t+1:T}^{\mu}$(\cdot\mid $s_{t+1}$),\,$P_{t+1:T}^{\pi}$(\cdot\mid $s_{t+1}$)\bigr) \le \sum_{j=t+1}^{T}\ell_j$.

Construct a maximal coupling of the suffix processes.

Apply the union bound over the suffix.

If $D_t \le \Delta(\mu,\pi)$ for all $t$, then $\bigl|\Delta(\mu,\pi)\bigr| \le \frac{4\xi}{T}\sum_{t=1}^{T}\sum_{j=t+1}^{T}\ell_j$.

Start from Lemma 2 and bound $|R(x,y)|$.

Replace the future likelihood‑ratio term with a TV bound.

Bound the sampled‑token term by the token‑level divergence.

Combine the two bounds and simplify.

Under the CPPO constraints $w_t D_t \le c_t$, the performance gap satisfies $J(\pi) - J(\mu) \ge L'_{\mu}(\pi) - 2\xi T (T-1)\bar{\ell}\,\delta_b$, where $\bar{\ell}=\max_t c_t/w_t$ and $\delta_b$ is the centered prefix slack.

Step 1 – Reduce the residual to a weighted sum of token‑level divergences.

Step 2 – Translate prefix‑budget constraints into expectations.

Step 3 – Apply Abel summation to bound the weighted sum.

Combine with Lemma 2 to obtain the final bound.

Table 2 classifies existing trust‑region variants by the divergence statistic they monitor and the point of constraint application, highlighting CPPO’s unique combination of token‑level divergence with prefix‑sum budgeting.

The algorithmic contrast—TRM masks entire sequences while CPPO masks individual token updates—illustrates why CPPO can enforce position‑aware thresholds without sacrificing per‑token granularity.

Complementary recent works (NFPO, FIPO, entropy‑based analyses, soft‑masking) modify the policy objective along orthogonal axes; CPPO’s contribution is orthogonal, focusing on how divergence budgets are allocated across positions and prefixes.

Soft Trust-Region Gates

Soft‑gate variants of CPPO and their ablations are described and formalized.

This appendix details the soft‑gate variant of CPPO, the gradient‑scaling interpretation, the mixture‑policy construction used for formal guarantees, and the default inverse gate compatible with SAPO schemes.

Instead of abruptly zeroing a token’s gradient when it violates its trust‑region, we smoothly attenuate the gradient proportionally to how close the token is to the violation boundary.

How does this soft‑gate differ from simply clipping the loss value?

Clipping would truncate the loss at a fixed threshold, discarding any information about how far the token is beyond the bound. The soft‑gate multiplies the loss by a value in [0, 1] that smoothly reflects the degree of violation, preserving gradient direction while limiting magnitude.

To prove a formal TV‑distance guarantee for a soft‑gate, we blend the baseline policy $\mu$ with the learned policy $\pi$ using a token‑wise mixing coefficient $g_s$.

Why isn’t the mixture policy actually deployed in training?

Implementing $\pi_g$ would require sampling from a convex combination of two policies at every step, which adds overhead without empirical benefit; the soft‑gate we use directly scales the loss, achieving the same effect more efficiently.

The inverse‑gate $g_{\text{inv}}(x)=\min(1,1/x)$ satisfies the admissibility condition $x\,g(x)\le1$ and therefore works as a drop‑in soft gate for CPPO.

What would happen if we used a gate that does not satisfy $x\,g(x)\le1$?

The scaled loss could exceed the hard‑mask bound, breaking the theoretical TV guarantee and potentially causing uncontrolled gradient explosions.

Conclusion and Related Work

We recap CPPO’s advantages and position it among existing trust‑region methods.

Uniform token‑level thresholds treat every position identically, ignoring that early deviations cascade through the autoregressive generation process. By starting from the finite‑horizon performance‑difference identity we obtain a prefix‑constrained policy‑improvement bound that makes these effects explicit. CPPO implements this bound with two mechanisms—a decreasing position weight $w_t$ that tightens the threshold early, and a cumulative prefix budget $\delta$ that caps later divergence once the weighted prefix average rises.

Both mechanisms operate solely through the masking decision, so CPPO reuses the PPO/GRPO ratio‑advantage objective and the same per‑token divergence as DPPO, introducing no extra loss term. This reuse preserves the computational profile of existing token‑level trust‑region methods while altering where the trust region is applied.

Across four Qwen3 configurations—dense vs. MoE models and Base vs. post‑trained checkpoints—CPPO attains the best validation AIME24/25/26 Avg@16 scores. Ablation studies show that the observed gain disappears when either the position weight or the prefix budget is removed, confirming that both components are essential.

Sampled‑ratio methods such as PPO, GRPO, Dr. GRPO, REINFORCE++, DAPO, CISPO, GSPO, GMPO, MinPRO, and GPG all control updates via one or more sampled importance ratios, providing only a single‑sample estimate of policy shift. In contrast, distributional‑divergence trust regions replace the ratio test with a token‑level divergence $D_t$ (e.g., TV or KL) and typically apply a uniform threshold, as in DPPO and the TRM family.

Classical trust‑region methods (Kakade and Langford 2002; Schulman et al. 2015; Achiam et al. 2017; Peters et al. 2010; Abdolmaleki et al. 2018; Song et al. 2019) constrain KL in standard RL, while later trust‑region‑guided PPO variants adjust clip thresholds or rollback behavior using divergence information (Engstrom et al. 2020; Andrychowicz et al. 2021). These works modify how the trust region is measured, whereas CPPO modifies where the allowed movement can accumulate under a fixed token‑level divergence.

Thus CPPO’s contribution is a prefix‑budgeted, position‑aware token mask that operates on a fixed token‑level divergence statistic, rather than proposing a new divergence measure. By weighting early tokens more heavily and capping the cumulative prefix, CPPO addresses the autoregressive asymmetry that uniform trust regions overlook.

Supplementary Proofs

This section records auxiliary theoretical results that support the CPPO analysis.

Proposition 5 shows that the linear schedule $w_t = 1 - c(t-1)$ yields a strictly decreasing sequence $r_t = 4\xi \bar{\ell}\, g_t$, satisfying the monotonicity condition required by Theorem 1.

Corollary 6 derives the uniform token‑level threshold $\bar{\ell} = \delta$, leading to the constant $C_{\text{uniform}} = \frac{2\xi T}{(T-1)}\,\delta^{2}$ and the ratio $\frac{\text{CCPPO}}{C_{\text{uniform}}}= \frac{\delta_b}{\delta}$, which is tighter whenever $\delta_b < \delta$.

Corollary 7 extends the analysis to a position‑dependent threshold $\bar{\ell}= \delta / w_{\min}$, yielding $C_{\text{impl}} = \frac{2\xi T}{(T-1)}\,\frac{\delta}{w_{\min}}\,\delta_b$ and a tighter bound whenever $\delta_b < \delta w_{\min}$.

Proposition 8 quantifies the slack $\eta$ introduced by the initial prefix, giving the bound $\Delta(\mu,\pi) \le \frac{2\xi T}{(T-1)}\,\bar{\ell}\,\delta_b + \eta + \frac{4\xi \bar{\ell}}{T}\,(w_1^{-1}-1)$; for the linear schedule $w_1=1$, the extra term vanishes.

Lemma B.9 introduces the product‑form suffix $\beta_t = 1 - \prod_{j=t+1}^{T} (1 - \ell_j)$ and shows that $\Delta(\mu,\pi) \le \frac{4\xi \delta_b}{T}\sum_{t=1}^{T} \lambda_\beta(t)$, yielding an $O(T\delta_b)$ bound without requiring $\delta_b = O(1/T)$.

The technical lemmas collect standard identities: $\|P - Q\|_{\text{TV}} = \tfrac12\|P - Q\|_{1}$, the likelihood‑ratio expectation $\mathbb{E}_{X\sim q}[p(X)/q(X)] = 1$, the coupling characterization of total variation, the weighted averaging inequality $\frac{\sum a_t}{\sum w_t} \le \max_t \frac{a_t}{w_t}$, and Abel summation $\sum r_t \Delta_t = r_n S_n + \sum (r_t - r_{t+1}) S_t$.

**Figure 7.** Warm-up diagnostics for the three Base-model CPPO runs. Each column is one model. Top: mean effective $\delta_b$ after the per-sequence warm-up calibration. Bottom: fraction of masked tokens rejected by the prefix-budget condition.

Training Configuration and Diagnostics

Implementation details, hyper‑parameters, and full training diagnostics.

The training stack uses the ver‑compatible GRPO/DAPO trainer with group‑normalized advantages, the `mask_std_0` filter (skipping prompts whose advantage variance is zero), no entropy regularizer, and no KL‑to‑reference penalty. Optimization is performed by AdamW with a peak learning rate of 1.

At each iteration we sample a batch of prompts, unroll $n$ responses per prompt under the current policy $\mu$, and split the rollout batch into gradient minibatches of size $\text{train\_bs}$ for the policy update. Rollout sampling uses the actor’s untruncated softmax; validation rollouts employ temperature $0.7$ and top‑$p=0.95$ with 16 samples per prompt.

Table 3 lists the per‑model rollout, update, and trust‑region configuration. “Prompts” is the prompt count per rollout iteration, $n$ the number of responses per prompt, and “updates” the number of gradient minibatches per rollout iteration. The token‑level threshold scale $\delta$ is shared by DPPO and CPPO; the $\delta_{\min}$ column shows the warm‑up calibration minimum, with an upper bound of $2\delta_{\min}$, and $w_{\min}$ is the weight floor.

**Table 3.** Per-model rollout, update, and trust-region configuration. "Prompts" is the prompt count per rollout iteration, $n$ is the number of responses per prompt, and "updates" is the number of gradient minibatches per rollout iteration. The token-level threshold scale $\delta$ is shared by DPPO and CPPO; the listed $\delta_b^{\min}$ values are the minimum values used by the Base-model warm-up calibration, whose upper bound is dynamically bounded at $2\delta_b^{\min}$; $w_{\min}$ is the weight floor.

Evaluation is performed on three AIME benchmarks (AIME24, AIME25, AIME26). For each benchmark we report Avg@16, the success rate over 16 sampled completions per prompt under the validation decoding configuration. The overall score is the unweighted mean of the three per‑benchmark Avg@16 values.

Baselines are grouped by the signal they use to decide whether a token update stays inside the trust region. GRPO and CISPO operate directly on the sampled importance ratio $\rho_t$ (CISPO uses asymmetric clip thresholds). MinPRO is a prefix‑ratio baseline that weights the sampled ratio by a non‑cumulative minimum‑prefix‑ratio surrogate. DPPO applies a uniform reduced‑TV token‑level mask with the per‑model $\delta$ from Table 3. CPPO uses the same $\delta$ inside its weighted and prefix constraints.

Batched mask computation follows Algorithm 1. For a valid‑token mask $M$, padding positions are zeroed in the weights and in $Z = w\,D$. A cumulative sum along the response dimension yields tensors $W$ and $S$ for all positions; a one‑token right shift produces $W^{-}$ and $S^{-}$ so that the threshold at token $t$ uses only the preceding prefix. The final binary mask multiplies into the token‑level ratio‑advantage terms of the PPO‑style clipped objective.

During Base‑model warm‑up, the average token‑level divergence is initially large but decays rapidly. To avoid clipping early exploratory tokens, the sequence‑independent threshold is replaced by a dynamic budget $\delta_{\text{seq}\,b}= \operatorname{clamp}\bigl(\operatorname{Quantile}(D_{1:T},0.9),\,\delta_{\min}\bigr)$, where the 90th percentile of the divergence sequence adapts the constraint to current statistics.

Full training diagnostics (Figures 8–11) separate the three validation components (AIME24/25/26) and report training reward, response length, and relative log‑probability error. The reward curves are smoothed for readability; the other panels expose stability checks rather than objectives.

**Figure 8.** Complete training diagnostics for Qwen3-1.7B (post-trained). Top: AIME24, AIME25, and AIME26 validation Avg@16. Bottom: training reward, response length, and relative log-probability error. Only reward is smoothed for readability.

Read the original paper

Open the simplified reader on Paperglide