LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL improves long-context reasoning by using agent-derived distractors and entity-level process rewards.

How can we improve long-context reasoning in LLMs by using agent-derived distractors and fine-grained rubric rewards instead of just outcome-based signals?

Large language models struggle with long-context reasoning because they often fail to distinguish key information from distractors and rely on sparse, outcome-only rewards that mask faulty intermediate reasoning. The authors introduce LongTraceRL, which constructs challenging training data using search agent trajectories to create tiered distractors and applies an entity-level rubric reward to supervise reasoning steps. On five long-context benchmarks, this approach consistently outperforms strong baselines, with a 4B model achieving an average gain of 5.7 points over its base version.

Paper Primer

The method hinges on two innovations: using search agent behavior to generate "high-confusability" distractors and enforcing a positive-only rubric reward. By tracking documents an agent opened but did not cite, the model learns to filter out relevant-looking but ultimately incorrect information, while the rubric reward forces the model to explicitly reference gold entities along the reasoning chain to earn a score.

LongTraceRL significantly improves reasoning performance across multiple model scales.

Evaluation on five benchmarks (AA-LCR, MRCR, FRAMES, LongBench v2, LongReason) shows consistent gains over base models and existing RL baselines. Qwen3-4B-Thinking-2507 improved by +5.7 points on average, surpassing the strongest baseline (LongRLVR) by +2.5 points.

The rubric reward is the primary driver of performance gains.

Ablating the rubric reward (LongTraceRL-GRPO) on the 4B backbone dropped the average score from 59.0 to 53.7, nearly erasing the improvement.

Why is the "positive-only" reward strategy necessary?

Without it, the model can "hack" the rubric reward by simply enumerating entities mentioned in the context rather than using them for genuine reasoning. Restricting the rubric reward to only those responses that reach the correct final answer forces the model to ground its reasoning in the evidence.

How do these distractors differ from standard random sampling?

Standard methods sample distractors randomly, which are often off-topic and easy to filter. LongTraceRL uses documents that a search agent actually opened during a successful search, creating "high-confusability" distractors that are topically relevant and require deeper reasoning to distinguish from the gold evidence.

Researchers working on long-context LLMs should shift from outcome-only supervision to entity-level process rewards and prioritize distractor quality by simulating realistic search agent behavior rather than relying on random document sampling.

The Long-Context Reasoning Challenge

We expose why outcome‑only rewards fail in long contexts and introduce LONG‑TRACERL to fix them.

Long‑Context Reasoning enables large language models to locate and combine key information across extensive, distracting texts, a prerequisite for multi‑hop inference and coherent long‑form generation. In practice, as the context length grows, models frequently hallucinate, cite irrelevant passages, or rely on fragmented retrieval, making long‑text reasoning a major deployment bottleneck.

The matrix contains $64{,}000^2 = 4.1\times10^{9}$ entries.

Memory required $= 4.1\times10^{9}\times4\text{ B} \approx 16\text{ GB}$, far exceeding typical GPU memory.

This illustrates why naïve attention becomes infeasible for truly long contexts.

The quadratic memory growth forces models to truncate or approximate, which in turn limits their ability to reason over the full document.

Reinforcement Learning with Verifiable Rewards (RLVR) has shown promise for guiding LLMs toward correct answers, but existing long‑context RL methods suffer two critical flaws. First, they rely on distractors that are easy to distinguish from the relevant documents, providing little confusability. Second, they use only outcome‑only rewards, which are sparse and can be satisfied by lucky guesses without proper intermediate reasoning.

LONG‑TRACERL addresses both issues. For data, we generate multi‑hop questions via knowledge‑graph random walks and construct tiered distractors from real search‑agent trajectories: Tier‑1 documents are those the agent read but did not cite (high confusability), while Tier‑2 are retrieved but never opened (low confusability). For supervision, we introduce a rubric reward that evaluates each gold entity along the reasoning chain, applied only to responses that obtain the correct final answer (positive‑only strategy), thereby distinguishing high‑quality reasoning from mere answer‑guessing.

**Figure 1.** Comparison between prior long-context RL approaches based on easy distractors and outcome-only rewards, and our proposed LONGTRACERL.

Prior Approaches to Long-Context Training

We survey prior approaches to long‑context data synthesis and reinforcement learning.

Long‑Context Synthetic Data generation hinges on two decisions: how questions are built and how the surrounding context is assembled.

For question construction, some works recycle short‑context multi‑hop QA sets such as MuSiQue and HotpotQA, while others generate questions de novo; a newer line leverages structured knowledge graphs (e.g., QwenLong‑L1.5 samples multi‑hop paths, DeepDive walks Wikipedia’s graph).

Context assembly ranges from using a single long document to extending short‑context QA with additional distractor documents; early distractors are sampled randomly and are easy to filter, whereas Next‑Long applies hard‑negative mining based on dense‑retrieval similarity.

Long‑Context Reinforcement Learning builds on RLVR, which excels on self‑contained tasks but offers only outcome‑only rewards, providing no guidance for intermediate reasoning over extensive inputs.

Recent efforts add finer supervision: chunk‑level rewards (F$\beta$ scores), dense process‑level signals via co‑evolving reward models, and information‑gain measures (LongR); agentic RL also introduces citation‑aware rubric items and tool‑call utility classification, yet all operate at coarse granularity, leaving entity‑level supervision unexplored.

The LongTraceRL Framework

How LONGTRACERL builds hard distractors and rewards reasoning step‑by‑step.

Long‑context reasoning models stumble because training data contain only trivial distractors and the reward signals ignore the reasoning process.

**Figure 2.** Overview of the LONGTRACERL training data construction pipeline.

Generate a multi‑hop entity path via a controlled random walk on the Wikipedia hyperlink graph.

Synthesize a question that forces step‑by‑step reasoning over the entire path.

Collect several search trajectories from a capable agent and keep the successful ones.

Extract Tier‑1 and Tier‑2 distractors from the trajectories, then shuffle all documents to reach the target length L.

Instead of sampling random paragraphs, we reuse documents that a search agent actually opened (Tier‑1) or merely saw in results (Tier‑2), because those are the pieces that naturally confuse a reasoner.

The agent opens the Tier‑1 article, reads the sentence “City X lost the final in 2022”, but does not cite it.

The Tier‑2 headline appears in the result list but is never opened.

During assembly, the Tier‑1 article is placed before the gold passage, while the Tier‑2 headline is appended later.

The final context length reaches L = 2048 tokens after shuffling.

Tier‑1 distractors are hard because the model has already spent effort on them; Tier‑2 distractors add surface noise without extra reading cost.

How does this differ from simply sampling random Wikipedia paragraphs?

Random sampling ignores the agent’s intent; the sampled paragraph may be irrelevant or trivially easy. Agent‑derived distractors are chosen because the agent considered them useful enough to open, so they share vocabulary and topic with the question, making them genuinely confusable.

We give the model a fine‑grained score for how many gold entities it actually mentions, encouraging step‑by‑step grounding rather than just guessing the final answer.

Count matches: 2 out of 3 gold entities appear.

Compute raw recall: $\hat{r}_{rb}=2/3\approx0.67$.

Assume the group’s best raw score is $0.80$; normalized reward $r_{rb}=0.67/0.80\approx0.84$.

The rubric rewards partial grounding, yet still distinguishes a response that captures most of the reasoning chain.

Why not just reward exact answer match instead of counting entities?

Exact match ignores the process; a model could guess the answer without any reasoning. The rubric forces the model to surface the intermediate evidence, which is essential for reliable long‑context reasoning.

A binary signal that tells the learner whether its final short answer is correct.

What happens if the model gets the answer right but never mentions any gold entity?

The outcome reward will be 1, but the rubric contribution will be 0 because the rubric is only granted when the answer is correct *and* the model references evidence. The combined reward therefore reflects both correctness and grounding.

Experimental Setup

We train LongTraceRL on 128 K‑token contexts using 200 RL iterations and 32 × H800 GPUs.

LongTraceRL converges within 200 RL iterations.

Training runs for 200 iterations and reaches a stable policy on the long‑context QA task.

Qwen3‑Thinking is a dense 4 B‑parameter language model fine‑tuned for multi‑step reasoning tasks.

DeepSeek‑R1 is a distilled 8 B‑parameter model derived from a larger teacher, offering a compact yet capable reasoning baseline.

Main Results

LONGTRACERL outperforms baselines across all model scales on long‑context benchmarks.

LONGTRACERL outperforms all baselines on every model scale, achieving an average score of 59.0 on the Qwen3‑4B‑Thinking‑2507 backbone.

Table 1 shows LONGTRACERL’s average of 59.0, a +5.7‑point improvement over the base model and +2.5 points over the strongest baseline LongRLVR.

**Table 1.** Main results on long-context reasoning benchmarks.

**Table 2.** Performance of LONGTRACERL with different rubric reward weight $\alpha$.

Table 3 compares LONGTRACERL across four distractor strategies, showing that more confusable distractors (e.g., traj‑tiered) lead to higher scores, which indicates the model self‑regulates its response length and truncation behavior.

LONGTRACERL consistently outperforms baselines across all model scales.

Ablation Studies

Ablations isolate the impact of rubric weighting, distractor design, and reward polarity on LongTraceRL performance.

Setting $α=0.3$ yields the highest overall average score of 59.0.

Table 2 shows the 0.3 setting outperforms 0.1 and 0.5 across all five benchmarks.

When $α$ is reduced to 0.1, AA‑LCR drops to 39.2.

The same sweep reports a 2.6‑point decline on the reasoning‑intensive AA‑LCR benchmark.

Increasing $α$ to 0.5 lowers the overall average to 57.1.

Table 2 records a consistent performance dip across all tasks under the stronger rubric weight.

Trajectory‑tiered distractors achieve a 50.03 % macro overlap with rubric entities, far above random’s 1.35 %.

Table 4 quantifies the overlap; the tiered strategy’s macro average is 50.03 % versus 1.35 % for the random baseline.

The positive & negative reward variant reduces AA‑LCR to 36.2.

Table 5 shows the AA‑LCR score for the variant is 36.2, a 5.6‑point drop relative to the positive‑only version.

**Figure 4.** Rubric, outcome and combined raw reward dynamics for the two reward strategies.

**Table 4.** Statistics on how much distractors overlap with rubric entities. Higher ratios indicate harder distractors. #Distr.: number of distractor documents. #w/ Rub.: number of distractor documents containing $\ge 1$ rubric entity. Ent-Recall: average fraction of rubric entities appearing in a distractor. Micro/Macro Avg: ratio of $\frac{\#w/ Rub.}{\#Distr.}$, aggregated globally (micro) or per sample then averaged (macro).

**Table 5.** Performance of LONGTRACERL with different reward strategies.

Supplementary Material

Appendix provides supplementary details, limitations, ethical notes, and additional analyses.

We introduce LONGTRACERL, a framework that builds challenging training data from trajectory‑based distractors and adds an entity‑level rubric reward for fine‑grained process supervision.

Experiments on five long‑context benchmarks across three model families show consistent gains over prior long‑context RL methods.

Further analysis isolates the contribution of each design choice, confirming their effectiveness.

The appendix also outlines the study’s limitations, ethical considerations, and full reference list.

Our data pipeline relies exclusively on the KILT Wikipedia snapshot, which may restrict reasoning pattern diversity despite successful transfer to downstream domains.

The quality of agent‑derived distractors depends on the capabilities of the search‑agent used; stronger or weaker agents could alter distractor difficulty.

All models and datasets are publicly available under permissive licenses, and the method does not involve human subjects or raise dual‑use concerns.

References include the primary LONGTRACERL papers, related long‑context RL work, and benchmark datasets.

Table 6 compares training datasets: LONGTRACERL generates questions via knowledge‑graph random walks with up to eight hops, yielding deeper reasoning chains than prior methods.

Case studies illustrate how the rubric reward guides the model to visit each gold entity in order and resolve ambiguities such as conflicting cues or pronoun references.

Prompt templates for multi‑hop QA generation and outcome reward judgement are provided in the supplementary material.

The task description specifies the input format (entity path and descriptions) and output format (question, answer, entity list) for dataset construction.

The judge component extracts the final answer from a model response and reports a binary match against the ground truth.

The image contains a text snippet enclosed in a dashed-line box that reads: "refers to the 'one customer' instead of 'one company'".

This image displays a yellow circular icon containing a horizontal rectangular bar, resembling a "minus" or "no entry" symbol.

Read the original paper

Open the simplified reader on Paperglide