RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

RREDCoT redistributes rewards across reasoning traces using the model's own predictive distribution to improve sample efficiency.

Reasoning models trained with Reinforcement Learning (RL) typically receive a single reward only after generating a complete Chain-of-Thought (CoT) trace, leaving individual reasoning steps without direct supervision. RREDCoT approximates optimal reward redistribution by using the language model itself to estimate the value of intermediate reasoning segments, avoiding the high computational cost of Monte Carlo sampling. This approach significantly improves sample efficiency and performance in long-context reasoning tasks compared to standard Group Relative Policy Optimization (GRPO).

Paper Primer

The core mechanism hinges on adapting the RUDDER (Reward Decomposition for Delayed Rewards) framework to the CoT generation Markov Decision Process. By calculating the difference in expected future utility between consecutive segments, the method assigns credit to specific reasoning steps without requiring additional models or extra generation steps.

To make this tractable, the authors introduce a hybrid keyword-entropy segmentation strategy that splits traces into meaningful chunks. The model then uses an importance-sampling estimator to weigh these segments based on their contribution to the final correct answer.

RREDCoT improves sample efficiency and final performance in long-context reasoning tasks.

Empirical evaluation on the Numina-CoT dataset with 25k-token generation lengths shows consistent performance gains over standard GRPO. The method achieves these gains while requiring significantly less compute than Monte Carlo-based credit assignment.

Why is standard RL fine-tuning for reasoning models inefficient?

Standard methods assign a single reward to the entire CoT trace only after the final answer is produced, providing no direct supervision for the intermediate reasoning steps that lead to that answer.

What is the primary advantage of RREDCoT over Monte Carlo sampling for credit assignment?

Monte Carlo sampling requires generating many additional sequences to estimate intermediate values, which is computationally prohibitive for long CoT traces; RREDCoT uses the model's own predictive distribution to estimate these values without extra generation.

RREDCoT provides a computationally tractable way to inject fine-grained supervision into reasoning models, making it a viable drop-in replacement for uniform reward distribution in RLVR pipelines.

Abstract and Introduction

Identifies delayed, high‑variance rewards in CoT RL fine‑tuning and motivates reward redistribution.

The RL fine‑tuning of reasoning language models assigns reward only after a full Chain‑of‑Thought (CoT) trace, yielding a delayed, high‑variance signal. Existing approaches such as Group Relative Policy Optimization (GRPO) rely on Monte‑Carlo estimates, which become noisy and computationally costly for long traces. This lack of fine‑grained supervision hampers learning, motivating a credit‑assignment strategy that redistributes reward to important trace segments.

**Figure 1.** Overview of the RREDCoT algorithm for reward redistribution. To compute a reward redistribution $\sigma$: (1) generate a CoT trace, (2) segment the trace, (3) compute each segment's immediate reward, and (4) use these rewards to derive the final redistribution.

**Figure 5.** Correlation of PRM prediction to the different components of RREDCoT attribution. PRM models seem to be more correlated to NLL changes of the thought NLL than to the change of NLL of the reference answer even when conditioned on it.

**Figure 6.** Correlation between the PRM and our credit assignment approach on 100 examples from open-rs dataset. Green indicates that the ultimate answer was correct according to the verifiers, while red points are incorrect. The x and y axes are the attribution by the PRM and our method, accordingly. The CoT traces were generated using Deepseek-R1-Qwen2.5-7B-Distill model while the PRM was computed using Qwen2.5-Math-PRM-7B. Our attribution method results in better correlation to the ground truth than the PRM approach, even when the PRM is conditioned on the correct answer.

CoT Generation as RL

We cast chain‑of‑thought generation as a sequential decision process with delayed rewards.

Chain‑of‑thought (CoT) generation produces long reasoning traces but only the final answer receives a reward, making learning inefficient. The delayed, sparse feedback hampers credit assignment and leads to high variance in policy gradients.

Generation is viewed as a step‑by‑step decision process where each reasoning segment is an action and the model only gets reward after producing the final answer.

**Figure 2.** (Right) The MDP structure of autoregressive CoT generation. At each point, either another segment $u_{t+1}$ or the answer $y$ can be generated. The state $s_t$ at each point consists of the entire sequence generated so far and the original input: $\{x, u_1 \dots u_{t-1}\}$. (Left) Hybrid segmentation approach. (a) keyword segmentation; (b) iterative merging of the lowest total entropy pairs; (c) merging stops when criteria is met (number of segments); (d) the segment entropies are homogenized.

Reward Integration

Integrating Reward Redistribution into Commonly Used RL Objectives

The RREDCoT redistribution approximation derived so far depends onlyon the properties of the CoT generation MDP. This means that it can be integrated into any RL objective so long as it is applied to a problem of that MDP structure. One important condition of reward redistribution is return‑equivalence, meaning that the original episode return must be preserved in the reward redistribution SDP. We further reformulate the generalized policy gradient objective from Shao et al. [2024] to include the redistribution of rewards:

$$ \nabla_{\theta} J_{A}(\theta)=\mathbb{E}_{(q,o)\sim D}\; \sigma(t,q,o,\pi_{rf})\; GCA(q,o,\pi_{rf})\; \nabla_{\theta}\log \pi_{\theta}(o_{t}\mid q,o_{<t}) \tag{11} $$

where $\sigma(t,q,o,\pi_{rf})$ is the token‑wise reward coefficient that depends on the reference model and question‑answer pair. The policy gradient formulations attribute reward uniformly over the whole sequence (Eq. 12):

$$ \sigma(t,q,o,\pi_{rf})=\frac{1}{|o|} \tag{12} $$

$\sigma$ must sum up to 1 in order to satisfy the return equivalence from Arjona‑Medina et al. [2019]. We note that this formulation does not bias the original objective according to Theorem 1 in Arjona‑Medina et al. [2019] if the condition holds. This leads to the same optimal policy. In practice, the diversity of data and breadth of autoregressive predictive distribution mean that full convergence (i.e., perfect knowledge of the general reasoning) is unattainable. As a consequence, speeding up the convergence rate would mean de facto improvement of final performance.

Several GRPO improvements use the normalizer $\sigma$ for reward shaping without adhering to these conditions, therefore changing the optimal policy. For example, BNPO [Xiao et al., 2025] uses $\sigma=\frac{1}{|B|}$ where $B$ is the number of tokens in a batch. DR‑GRPO [Liu et al., 2025b] uses $\sigma=\frac{1}{M}$ where $M$ is the maximum number of tokens allowed in the batch.

Table 1: RREDCoT performance improvements on Numina‑CoT [Li et al., 2024] dataset with long generation length (25 k tokens). RREDCoT yields greater improvement than GRPO.

Model

RREDCoT Method

RREDCoT reshapes reward redistribution via hybrid segmentation and PR‑style value estimation.

Delayed rewards make credit assignment in chain‑of‑thought (CoT) traces expensive; existing RUDDER‑style redistribution needs thousands of model evaluations per trace.

First split the CoT trace at generic delimiters, then iteratively merge neighboring pieces that together have the lowest entropy until a target number of exit points is reached, ensuring each segment is as self‑contained as possible.

Initial split on newlines yields eight single‑token segments.

Compute pairwise combined entropies; the lowest is tokens 1+2 (0.5 bits).

Merge tokens 1 and 2 → segment A with entropy 0.5.

Repeat: next lowest pair is tokens 5+6 (0.9 bits); merge → segment B.

Continue merging until only three segments remain: A (0.5), C (1.8 + 2.0 = 3.8), D (1.5 + 0.1 = 1.6).

By merging low‑entropy neighbors we keep the high‑entropy region (tokens 3‑4) as a distinct exit point, which carries the most information for reward estimation.

Why not simply split on punctuation without the entropy‑based merging?

Pure punctuation splits ignore the model’s uncertainty; low‑entropy tokens are already predictable and provide little signal, so merging them prevents wasting credit‑assignment budget on trivial pieces.

Treat the reference answer‑solution pairs as a proposal distribution and reweight them by the model’s own probabilities to obtain an unbiased estimate of the value term.

Compute importance weight for $(y_1,u_1)$: $\frac{0.6\times0.5}{0.5}=0.6$.

Compute importance weight for $(y_2,u_2)$: $\frac{0.2\times0.4}{0.5}=0.16$.

Apply weights to utilities: $0.6\times1=0.6$, $0.16\times0=0$.

Average over $N=2$: $\hat{v}^{\text{our}}_w(s_t)=\frac{0.6+0}{2}=0.3$.

Even though the second pair contributes no utility, its presence reduces the estimate because the uniform proposal dilutes the weight of the high‑utility pair.

Does the PR‑style estimator always underestimate the true value?

Yes, under the non‑negative utility assumption the bias term is the negative contribution of all answer‑step pairs omitted from the reference set, so the estimator is biased low unless the set captures every non‑zero‑utility sequence.

By combining the hybrid segmentation with the PR‑style estimator we obtain a tractable credit‑assignment pipeline that respects the optimal redistribution condition while keeping computational cost proportional to the number of high‑entropy exit points.

Experiments

We evaluate RREDCoT’s efficiency, bias, and correlation with attribution methods across benchmarks.

RREDCoT‑Nano attains 0.858 on MATH500, surpassing Open‑RS by +0.058 and setting the highest score among the three models.

Table 2 reports 0.800 for Open‑RS and 0.858 for RREDCoT‑Nano on MATH500.

**Figure 7.** Standard Deviation of the MC value estimator. The values were obtained by bootstrapping the original values from 20 completions. Vertical axis is the number of MC samples - the number of rollouts selected for value estimation. The horizontal axis is the number of exit points where the value was evaluated.

**Figure 3.** Proportion of rollouts lost by the truncated MC sampling with respect to truncation length. (a) MATH-500 dataset; (b) AIME-25 dataset. Computing MC sampling estimates in (b) took 100 GPU-hours for the 30 questions given and maximum number of segments of 40. In both cases we see that truncation of the MC generations leads to loss of substantial amount of the completions, including those that lead to correct answers.

**Figure 4.** Correlation between LOO, Gradient attribution, MC sampling and RREDCoT credit assignment techniques. On the x-axis we have the percent of the trace used from the right, e.g., a value of 0.25 means that from each trace, the first 25 Blue lines are correlation coefficients (left scale) while the red lines are their corresponding log(p-values) (right scale). All values were computed using the model that produced the CoT. The gradient norm and LOO, being attribution methods, correlate highly with each other, while RREDCoT, being a credit assignment method, is most similar to MC sampling, especially at the later stages.

Appendices

Details on datasets, models, and equations exposing key challenges.

The training set inherited from open‑rs contains a 15 % label error rate, especially among questions marked “Hard” that lack correct step‑by‑step solutions. This noise, combined with the need to evaluate on diverse benchmarks and the computational cost of large‑scale sampling experiments, creates a substantial barrier to reliable learning and fair assessment.

These estimators aim to approximate the expected return of a solution path, but when many valid solutions exist they converge to the same value, undermining their usefulness.

Questions & answers

What is the main contribution of RREDCoT?

RREDCoT introduces a segment-level reward redistribution method for reasoning models trained with reinforcement learning, assigning credit to individual Chain-of-Thought reasoning steps rather than applying a single reward to the entire trace, thereby improving sample efficiency and performance on long-context reasoning tasks compared to standard GRPO.

What problem does RREDCoT address?

RREDCoT addresses the credit assignment problem in RL fine-tuning of reasoning language models, where a single delayed reward is assigned only after a full Chain-of-Thought trace is generated, leaving individual reasoning steps without direct supervision and producing high-variance, noisy policy gradients.

Why is standard RL fine-tuning for reasoning models inefficient?

Standard methods assign a single reward to the entire CoT trace only after the final answer is produced, providing no direct supervision for the intermediate reasoning steps that lead to that answer, which hampers learning and leads to high variance in policy gradients.

What theoretical framework does RREDCoT build on?

RREDCoT adapts the RUDDER (Reward Decomposition for Delayed Rewards) framework to the CoT generation Markov Decision Process, computing the difference in expected future utility between consecutive reasoning segments to assign credit to specific steps.

How does RREDCoT estimate intermediate reward values without extra model calls?

Instead of Monte Carlo sampling, which requires generating many additional sequences, RREDCoT uses the language model's own predictive distribution together with an importance-sampling (PR-style) estimator to weigh reasoning segments based on their contribution to the final correct answer, keeping computational cost proportional to the number of high-entropy exit points.

What is the hybrid keyword-entropy segmentation strategy?

The hybrid keyword-entropy segmentation strategy splits CoT traces into meaningful chunks by combining keyword-based boundary detection with entropy-based merging: low-entropy tokens that are already highly predictable are merged with adjacent segments to avoid wasting the credit-assignment budget on trivial pieces.

Why not simply split CoT traces on punctuation without entropy-based merging?

Pure punctuation splits ignore the model's uncertainty; low-entropy tokens are already predictable and provide little signal, so merging them prevents wasting the credit-assignment budget on trivial pieces.

Does the PR-style importance-sampling estimator produce unbiased reward estimates?

No; under the non-negative utility assumption the estimator is biased low, because the bias term equals the negative contribution of all answer-step pairs omitted from the reference set, meaning the estimator underestimates the true value unless the reference set captures every non-zero-utility sequence.

What condition must the reward redistribution satisfy to preserve the original learning objective?

The redistribution must satisfy return-equivalence, meaning the original episode return must be preserved in the redistributed reward signal; the token-wise reward coefficient σ must sum to 1, which ensures the optimal policy is unchanged according to Theorem 1 of Arjona-Medina et al. (2019).

How does RREDCoT differ from GRPO improvements such as BNPO and DR-GRPO?

BNPO and DR-GRPO use the normalizer σ for reward shaping without adhering to the return-equivalence condition (e.g., σ = 1/|B| or σ = 1/M), which changes the optimal policy; RREDCoT's redistribution is designed to satisfy return-equivalence and therefore does not bias the original objective.

What dataset and experimental setting are used to evaluate RREDCoT?

RREDCoT is evaluated on the Numina-CoT dataset (Li et al., 2024) with long generation lengths of up to 25,000 tokens; the paper notes that the training set inherited from open-rs contains a 15% label error rate, particularly among questions marked 'Hard' that lack correct step-by-step solutions.

What are the key experimental results reported for RREDCoT?

The paper reports that RREDCoT yields greater performance improvement than GRPO on the Numina-CoT dataset with long generation lengths (25k tokens); specific numeric results are presented in Table 1, but the paper does not reproduce those exact figures in the provided text beyond stating RREDCoT outperforms GRPO.

What are the limitations of RREDCoT acknowledged in the paper?

The paper acknowledges that the PR-style estimator is biased low unless the reference set captures every non-zero-utility sequence, that the training data contains a 15% label error rate which hampers reliable learning, and that the computational cost of large-scale sampling experiments creates a substantial barrier to fair assessment.

Can RREDCoT be integrated into RL objectives other than GRPO?

Yes; the paper states that the RREDCoT redistribution approximation depends only on the properties of the CoT generation MDP and can be integrated into any RL objective applied to that MDP structure, making it a viable drop-in replacement for uniform reward distribution in RLVR pipelines.

Does RREDCoT require additional models or extra generation steps?

No; RREDCoT uses the language model itself to estimate the value of intermediate reasoning segments, avoiding the need for additional models or extra generation steps beyond the original CoT trace.

Where is the RREDCoT paper published and who are the authors?

The paper is available on arXiv at https://arxiv.org/abs/2606.06475; the paper does not specify author names or a venue in the provided text.

Key terms

Chain-of-Thought (CoT)
A multi-step reasoning trace generated by a language model before producing a final answer, intended to make intermediate reasoning explicit.
Reinforcement Learning from Verifiable Rewards (RLVR)
A training paradigm in which a language model is fine-tuned using reinforcement learning with rewards derived from verifiable correctness signals such as answer accuracy.
Group Relative Policy Optimization (GRPO)
A reinforcement learning algorithm for language models that estimates policy gradients by comparing rewards across a group of sampled outputs rather than using a separate value network.
RUDDER (Reward Decomposition for Delayed Rewards)
A framework that redistributes a delayed episode reward back to individual time steps by computing differences in expected future return, enabling more direct credit assignment.
Credit assignment
The problem of determining which specific actions or reasoning steps in a sequence were responsible for a final outcome or reward.
Reward redistribution
The process of decomposing a single delayed reward into step-level signals so that each intermediate action receives a portion of the total credit.
Return-equivalence
A condition requiring that the sum of redistributed step-level rewards equals the original episode return, ensuring the optimal policy is not changed by the redistribution.
Monte Carlo sampling
A method of estimating expected values by generating many random samples, used in RL to approximate intermediate value functions but computationally expensive for long sequences.
Importance-sampling (PR-style) estimator
A statistical estimator that reweights samples from one distribution to approximate expectations under another, used here to estimate segment utility without generating additional sequences.
Hybrid keyword-entropy segmentation
A strategy that splits CoT traces into segments using keyword-based boundaries and then merges low-entropy (highly predictable) segments to focus credit assignment on informative reasoning steps.
Token-wise reward coefficient (σ)
A per-token weighting factor applied during policy gradient computation that determines how much of the total reward is attributed to each token in the output sequence.
Markov Decision Process (MDP)
A mathematical framework for sequential decision-making in which an agent takes actions in states to maximize cumulative reward, used here to model CoT generation.
BNPO
A GRPO variant proposed by Xiao et al. (2025) that uses batch-size normalization (σ = 1/|B|) for reward shaping, which the paper notes changes the optimal policy by violating return-equivalence.
DR-GRPO
A GRPO variant proposed by Liu et al. (2025b) that normalizes rewards by the maximum token count in a batch (σ = 1/M), also noted to violate return-equivalence.
Numina-CoT
A dataset of chain-of-thought reasoning examples (Li et al., 2024) used in the paper's experiments, which the paper notes contains a 15% label error rate in its training split.
Policy gradient
A class of reinforcement learning algorithms that optimize a policy by computing the gradient of expected reward with respect to the policy's parameters.

Read the original paper

Open the simplified reader on Paperglide