Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

DRIFT localizes deep-research agent errors by tracking consequential claims rather than just evaluating final outcomes.

How can we automatically localize the specific steps where deep-research agents fail within their long, multi-step trajectories?

Deep-research agents solve tasks through long, complex trajectories, but current evaluation methods only check if the final answer is correct. This leaves the actual source of failure—often an early, unsupported claim that later steps treat as fact—hidden within the noise of normal exploration. DRIFT audits these trajectories by constructing a claim ledger that tracks when a belief is introduced, whether it is supported by evidence, and where it becomes consequential for the final answer. It marks spans as errors only when they commit to or propagate unsupported claims that affect the solution path. On the TELBENCH benchmark, DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points compared to bare LLM inspection.

Paper Primer

Deep-research trajectories are decision processes where agents form commitments about entities and constraints. Because agents often reuse these earlier claims without revalidation, a single unsupported premise can cascade into a final failure, making it impossible to diagnose the root cause by looking at the final answer alone.

DRIFT is a claim-centric auditing framework: it maps the trajectory into semantic spans, builds a ledger of agent commitments, and uses dependency tracing to isolate spans that rely on unsupported or conflicting evidence. It acts like a forensic accountant: it ignores benign exploration and focuses only on the "ledger entries" (claims) that the agent later treats as settled facts.

DRIFT significantly outperforms generic agentic auditing frameworks in localizing harmful error spans.

Comparison across five model families on the TELBENCH benchmark. Up to 30 percentage point improvement in first-error accuracy and span-level F1.

Process-level errors are not equivalent to final-answer failure.

Analysis of the 2,790-trajectory corpus. 36.9% of successful trajectories contain at least one annotated error span, while 97.3% of failed trajectories contain at least one.

Why is direct LLM prompting insufficient for diagnosing these trajectories?

Direct prompting is unstable because LLMs often mistake benign exploration for errors or over-focus on the final answer, failing to identify the specific early commitment that caused the trajectory to become unreliable.

What is the scope of the TELBENCH benchmark?

TELBENCH consists of 1,000 verified instances of deep-research trajectories, split into easy and hard categories, designed to test a model's ability to localize the earliest harmful error span among distracting noise like failed searches or tentative hypotheses.

Reliability in deep-research agents requires moving from outcome-based evaluation to process-level auditing. Researchers should prioritize tracking the lifecycle of agent claims to identify where unsupported commitments first enter the reasoning chain.

The Problem of Opaque Agent Trajectories

Evaluating only final answers hides early errors, so we argue for span‑level auditing of agent trajectories.

Evaluating deep‑research agents only by their final answer tells us whether they succeeded, but it obscures the specific reasoning steps that caused a failure. Because trajectories are long and heterogeneous, directly asking a language model to find errors in the full log is unreliable—it often confuses harmless exploration with genuine mistakes.

Agents generate extended decision processes; judging them solely on the outcome prevents us from seeing which intermediate claim first went wrong, so we need a method that pinpoints harmful spans within the trajectory.

The key shift is moving from outcome‑based evaluation to process‑based auditing of agent trajectories.

TELBENCH: A Dataset for Error Localization

We contrast prior outcome‑level benchmarks with our span‑level TELBENCH dataset.

Recent agent benchmarks have shifted from static QA to long‑horizon tasks, but they still evaluate only final outcomes. In contrast, TELBENCH provides span‑level annotations that let models pinpoint where a trajectory first goes wrong.

**Figure 1.** Data curation pipeline for TELBENCH, covering trajectory collection, log normalization, semantic-span segmentation, LLM-assisted candidate labeling, and expert-verified error annotation.

**Figure 2.** Mechanism analysis of annotated TELBENCH trajectories, showing error families, workflow-stage distributions, first-error patterns across settings, temporal positions, and Verified-1K coverage.

TELBENCH is a curated collection of expert‑annotated agent trajectories, each broken into semantic spans and labeled as error or non‑error to enable span‑level error localization.

The DRIFT Auditing Workflow

DRIFT audits claim‑centric trajectories to pinpoint error spans without extra annotations.

After gathering full agent trajectories and annotating span errors, we must locate the faulty spans using only the task question and the raw span texts. Existing approaches rely on external judgments or gold labels, which are unavailable at deployment. DRIFT solves this by auditing the claims embedded in the trajectory.

DRIFT treats a trajectory as a ledger of claims, checks each claim’s support, and traces dependencies to flag the spans that commit to unsupported claims — like a courtroom docket where every allegation is recorded, its evidence examined, and any unjustified verdict traced back to the moment it was entered.

Claim c₁ is recorded when s₁ runs; its first consequence appears at s₂ where the extracted date depends on the retrieved document.

Claim c₂ is recorded at s₂; its consequence is s₃, the final answer.

The ledger L = {c₁, c₂} now captures both the exploratory retrieval and the committed extraction.

Support Seeker later checks that s₂ provides direct evidence for “date X is correct”.

Dependency Tracer sees that c₂ is consequential and used in s₃; if support is missing, s₃ is flagged as an error span.

The ledger makes explicit which early exploratory steps become decisive, allowing the system to ignore harmless detours and focus on the claims that actually drive the answer.

How does DRIFT differ from classifying each span independently?

Independent span classification treats every token as an isolated decision, ignoring that many spans are merely exploratory. DRIFT groups spans by the claims they support, so only claims that become commitments are examined for errors, which dramatically reduces false positives on harmless exploration.

Given a task question q and an ordered list of spans T = (s₁,…,sₙ), DRIFT predicts the subset of spans that are erroneous by evaluating a binary predicate h(sⱼ) that flags spans committing to unsupported claims.

Claim Keeper builds ledger: c₁ = (“X exists”, i₁=1, b₁=3, U₁={3,4}, $\tau$₁=Plan, $\sigma$₁=exploratory), c₂ = (“Y contains X”, i₂=3, b₂=4, U₂={4}, $\tau$₂=Decide, $\sigma$₂=consequential).

Support Seeker marks c₂ as MISSING because no span provides evidence that Y actually contains X.

Dependency Tracer sees that c₂ is consequential and used in s₄, so h(s₄)=1.

Only s₄ is flagged as an error span; earlier exploratory spans are ignored.

Even though the agent performed many steps, DRIFT isolates the single span that propagated an unsupported claim to the final answer.

Why doesn’t DRIFT need judge results or gold labels as input?

Because DRIFT’s error detection is based entirely on internal claim consistency: it checks whether each claim is backed by evidence already present in the trajectory. External judgments would be redundant and unavailable at test time.

Claim Keeper: scan the full trajectory and record every decision‑relevant claim together with its introduction, first consequence, reuse set, type, and status.

Support Seeker: for each consequential claim, locate supporting evidence in the trajectory and assign a support status (DIRECT, WEAK, MISSING, CONFLICTING).

Dependency Tracer: propagate risk from weak or missing claims through the reuse graph, marking any span that commits to, reuses, amplifies, or finalizes such a claim as an error.

**Figure 3.** Overview of DRIFT: a claim-centric auditing workflow that builds trajectory-level claim ledgers, verifies support, and traces claim dependencies to localize first and follow-up errors.

**Table 3.** Operation stage taxonomy used for trajectory span annotation. Each span is assigned exactly one stage, regardless of whether it is an error span. The stage label describes the functional role of the span in the agent trajectory rather than its correctness.

Performance Evaluation

DRIFT outperforms baselines across model families on TELBENCH.

DRIFT attains the highest overall macro‑F1 across all four model families, e.g., 58.45 % on GPT‑5.4, a +22.33 % gain over the bare baseline.

Table 2

A straightforward audit that inspects the full trajectory without any auxiliary reasoning or claim tracking.

We evaluated five model families on TELBENCH using four diagnostic frameworks; each framework processed the same question‑span inputs and output error‑span indices, with three repetitions per setting.

**Table 2.** Easy/hard split results. All numbers are percentages. P, R, and F1 are macro-averaged; FEA denotes first-error accuracy. Superscripts show absolute improvements over the bare baseline.

**Figure 4.** Overall macro-F1 on TELBENCH.

DRIFT consistently outperforms bare baselines across all model families.

Robustness and Sensitivity Analysis

We assess DRIFT’s robustness across span complexity, modules, efficiency, and error types.

Increasing span complexity harms both Bare and DRIFT, but DRIFT’s degradation is milder, especially on high‑span trajectories where errors are more dispersed.

Across all span buckets, DRIFT outperforms Bare, confirming that structured trajectory auditing preserves localization ability under longer semantic contexts.

The ablation compares four configurations: the bare full‑trajectory baseline, adding the Claim Keeper (A), then Claim Keeper + Support checking (A + B), and finally the complete DRIFT pipeline with dependency tracing.

Performance rises stepwise; the Claim Keeper yields the biggest jump, while support checking and dependency tracing provide incremental gains in evidence grounding and span precision.

Efficiency analysis shows DRIFT largely lies on the Pareto frontier, improving F1 without a proportional token increase; the only outlier is Gemini, whose DRIFT run spends over half the tokens on thinking, inflating its average token budget.

Span‑level recall on the twelve most frequent error categories reveals that Bare models miss many evidence‑ and constraint‑related failures, whereas DRIFT raises recall consistently, with the strongest lifts on source verification, constraint semantics, and omitted checks.

**Figure 5.** Further analysis of DRIFT. We examine robustness across model scale and span complexity, then verify that the gains come from the proposed modules and remain competitive under token cost.

**Figure 6.** Span-level recall across frequent error types. DRIFT improves coverage especially on evidence- and constraint-related failures.

**Figure 7.** Annotation interface for expert span-level adjudication. The console shows the ordered semantic spans, LLM-assisted candidate errors, editable rationales, and final expert decisions.

Error Distribution and Burden

We quantify error distribution, stage risk, and effort across agents and introduce the error taxonomy.

Across the full 2,790‑trajectory corpus, 97.3 % of failed runs contain at least one annotated error span, while 36.9 % of successful runs still contain an error span, showing that agents can sometimes recover from local mistakes.

**Figure 8.** Basic error-burden statistics of annotated trajectories. We compare final failed and successful trajectories by whether they contain any annotated error span, the number of error spans per trajectory, the composition of error versus non-error spans, and the overall error spans density across benchmarks, frameworks, and model families.

To separate how often a stage appears from how risky it is, we compute a normalized error rate per operation stage (error spans ÷ total spans in that stage).

Retrieval dominates the raw span count but has the lowest normalized error rate (2.9 %). By contrast, decision‑making and finalization are far riskier, with error rates of 60.5 % and 51.8 % respectively; compute spans sit in the middle at 25.9 %.

**Figure 9.** Stage-normalized error rates across operation stages. Bars show the percentage of spans in each stage that are annotated as errors, while the gray line reports the total number of spans assigned to that stage. This normalization separates stages that are common in trajectories from stages that are intrinsically more error-prone.

Effort profiles (average steps, spans, and tool calls) vary markedly across benchmark‑model‑framework combos. MiroFlow tends to generate longer trajectories, especially GPT on BrowseComp, while OAgent keeps trajectories short but sometimes issues many tool calls.

**Figure 10.** Effort profiles across benchmarks, frameworks, and models. Average trajectory steps, annotated spans, and tool calls are reported for each benchmark-model-framework combination. Colors denote model families, while hatching distinguishes frameworks. The y-axes are piecewise-compressed above 400 steps, 20 spans, and 100 tool calls to preserve visibility of lower-effort settings.

Each trajectory span receives an operation‑stage label independent of correctness: Plan, Retrieve, Verify, Extract, Compute, Decide, Recover, or Finalize.

Errors are further grouped into six fault families—Constraint Handling, Search and Retrieval, Evidence Grounding, Entity Mapping, Information Processing, and Process Control—each containing several primary fault types.

The error taxonomy was induced in three rounds: (1) LLMs generated free‑form rationales for each error span; (2) a hierarchical map‑reduce clustering produced candidate types; (3) manual normalization merged duplicates and defined 18 primary faults.

Jointly using stage and fault labels lets us pinpoint, for example, a “Retrieve”‑stage error caused by a “Goal Drift” versus a “Retrieve”‑stage error caused by a “Candidate Scope Error”.

**Figure 11.** Ablation of Modules. Each module brings better performance.

Tables 4 and 5 list concrete error‑type annotations and the corresponding reasoning chains for two example trajectories, illustrating how the taxonomy is applied in practice.

Token consumption (Table 5) reports prompt and completion token totals per model‑framework configuration, providing a coarse proxy for computational cost.

Ablation Study

Qualitative ablations reveal how error‑propagation, unsupported evidence, and scope mistakes affect trajectories.

The ablation study confirms that each added component—Claim Keeper, Support Seeker, and the full DRIFT pipeline—yields monotonic gains in precision, recall, and F1 across all four base models.

**Figure 11.** Ablation of Modules. Each module brings better performance.

Case Study 1 examines a snooker‑match retrieval trajectory that introduces a wrong candidate (the 2021 UK Championship Final) before all constraints are checked, then repeatedly reinforces this mistaken branch, leading to a final answer that violates the loser‑player professional‑year condition.

Bare‑style systems typically flag only the final winner/loser mismatch (span s007), whereas DRIFT‑augmented models correctly identify the entire error chain (spans s001, s003, s007, s008), demonstrating the value of span‑level auditing for early‑candidate errors.

Case Study 2 presents a trajectory that arrives at the correct essay title but builds an unsupported evidence chain: a worker claim of seven talks is adopted without visible verification, and the main agent propagates this unsupported claim to the final report.

Models that ignore intermediate evidence (bare baselines) miss all three error spans, while DRIFT‑enabled variants recover the unsupported‑claim spans (s003‑s005), highlighting that correct answers alone do not guarantee trustworthy reasoning.

Case Study 3 involves a visual‑retrieval trajectory that mistakenly treats a partial fruit list (watermelon, pears, lemons) as exhaustive, then later asserts that bananas are absent, causing a scope error and an impossible‑answer claim.

DRIFT‑enhanced agents detect both the premature scope limitation (s004) and the unsupported impossibility claim (s007), whereas bare systems often return no error spans, underscoring the need for explicit candidate‑set verification.

Across diverse domains, DRIFT’s span‑level auditing consistently surfaces early‑stage mistakes that would otherwise remain hidden in final‑answer‑only evaluations.

Implementation Details

Lists the system and user prompts that drive DRIFT and the baseline.

This appendix enumerates the exact prompts fed to DRIFT’s modules and to the bare evaluation baseline. Each prompt enforces a JSON‑only response schema so that downstream processing can compare outputs reliably.

Common system prompt (used by all modules)

Prompt 0 – Common System Prompt (full text)

Prompt 1 – Bare Evaluation (baseline)

Prompt 2 – A: Claim Keeper

Prompt 3 – B: Broad Support Seeker

Prompt 4 – C: Specialist Auditor Gate

Prompt 5 – Dependency Backtrace (final reducer)

Read the original paper

Open the simplified reader on Paperglide