FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
Jia Deng, Yimeng Chen, Xiaoqing Xiang, Ziyang Zeng, Shuo Tang, Wayne Xin Zhao, Feng Chang, Chuan Hao, Yuan Wei, Ran Tao, Bryan Dai, Ji-Rong Wen
FORT synthesizes shortcut-resistant search tasks to force deep evidence acquisition in LLM agents.
How can we synthesize search tasks that force agents to perform multi-step evidence gathering rather than relying on "shortcuts" (e.g., guessing from partial information)?
Deep search agents often fail to perform long-horizon evidence gathering because training tasks contain "shortcuts"—clues that allow the model to guess the answer or find it through a single, cheap retrieval step. FORT (Framework of Shortcut-Resistant Training-Data Synthesis) constructs training data by building an internal evidence graph and applying adversarial refinement to ensure that every question forces a multi-step, non-obvious search trajectory. By training on this data, FORT-Searcher achieves the highest performance among comparable-size open-source agents on challenging deep search benchmarks.
Paper Primer
Existing search datasets often suffer from "shortcut" collapses where the intended multi-step search process is bypassed. For example, a question might expose a specific date or name that makes downstream queries immediately executable, or a single clue might be so selective that the agent identifies the answer before acquiring the intended evidence.
FORT intervenes during data synthesis by controlling four shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. It uses an internal evidence graph as a workspace to construct derived facts and fuzzes surface constants, then uses an adversarial agent to prune any draft questions that can be solved too quickly or via prior knowledge.
FORT-Searcher outperforms all comparable-size open-source search agents on deep search benchmarks.
Evaluation on BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and Seal-0. Achieved an overall score of 66.2, outperforming the next best comparable model (MiroThinker-1.7-mini) by 1.6 points.
Trajectory analysis confirms that FORT-trained models exhibit longer pre-answer search prefixes and lower prior-shortcut rates compared to models trained on standard open-source datasets, validating that the synthesis process successfully forces the intended evidence-acquisition behavior.
Why does "apparent" task complexity in existing datasets fail to translate into actual search difficulty?
Structural complexity (like hop count) does not account for the "cheapest identifying route." If a question exposes constants or contains highly selective clues, an agent can bypass the intended graph structure and reach the answer through a shortcut, rendering the structural complexity irrelevant to the realized search cost.
What is the role of the adversarial agent in the FORT pipeline?
The adversary acts as a trajectory-level calibrator. It attempts to solve draft questions; if it succeeds too quickly or uses prior knowledge to bypass search, the system identifies the specific shortcut risk (e.g., exposed constant) and triggers a repair of the question's clues or facts.
Introduction: The Shortcut Problem
We expose why synthetic search tasks let agents cheat and propose a shortcut‑resistant data synthesis.
Training deep search agents often fails because synthetic tasks contain “shortcuts” – partial clues that let the model answer without performing the intended multi‑step retrieval, collapsing the planned search process.
Shortcut risks are the ways a seemingly complex search task can be solved via a cheaper route, letting the agent bypass the intended evidence‑gathering steps.
Current search agents fail to generalize because training data often permits shortcuts that short‑circuit the intended multi‑step evidence acquisition.
Defining Search Difficulty
Formalizing how multi‑constraint retrieval tasks measure and expose search difficulty.
Multi‑turn retrieval tasks require an agent to satisfy a set of constraints by issuing queries to an interface $\Sigma$ and stopping once the answer is verified.
A task instance is a triple $q=(X, C_q, \Sigma)$ where $X$ is the answer space, $C_q$ is the collection of constraints, and $\Sigma$ is the retrieval interface that the agent can query.
Apply $c_1$: only $B_1$ and $B_2$ match the fern species, so $\text{Ans}(\{c_1\})=\{B_1,B_2\}$.
Apply $c_2$: $B_2$ is the only botanist whose advisor matches the clue, yielding $\text{Ans}(\{c_1,c_2\})=\{B_2\}$.
At this point the candidate pool is singleton; the agent can answer $B_2$ without checking $c_3$.
The task’s difficulty is already resolved after two clues because the intermediate candidate set shrinks to one, illustrating how a small subset can identify the answer.
Difficulty is measured by how many retrieval queries a no‑prior solver must issue (pure‑posterior cost $D_{\text{post}}$) versus how many a concrete solver actually uses ($\Omega$), with the gap $U_{\pi_0}$ capturing prior‑knowledge shortcuts.
No‑prior solver: query “fern species?” → retrieve page covering $c_1$; query “advisor name?” → retrieve page covering $c_2$; answer $B_2$ (2 queries).
Pretrained model: issue generic query “who is the botanist?” → model outputs $B_2$ without any external evidence (1 query).
The model’s internal knowledge lets it skip the evidence‑acquisition steps, creating a prior‑knowledge shortcut.
Proposition 1 formalizes the lower bound on difficulty: any identifying subset $P$ forces a route cost at least $\max(M_{\text{ev}}(P),\text{dep}(P))$, and the pure‑posterior cost $D_{\text{post}}$ is bounded below by the cheapest such route $Q^{\star}_{\Sigma}$.
**Table 1.** From shortcut risks to FORT controls. The table connects the difficulty quantities in Section 2 with the construction mechanisms in Section 3.
The FORT-Searcher Pipeline
Methodology builds shortcut‑resistant questions via evidence‑graph synthesis and trains a searcher to follow multi‑turn evidence acquisition.
Shortcut‑based shortcuts let a model guess answers without performing the intended multi‑step retrieval, undermining the training signal. Our method eliminates these shortcuts by shaping both the data generation process and the agent’s inference protocol.
The pipeline first synthesizes questions whose evidence graphs lack cheap identifying routes, then trains a search agent on the resulting multi‑turn trajectories.
An evidence graph is a workspace where nodes are real‑world entities and edges are verified facts; it lets us regulate the factors that create cheap identifying routes.
Algorithm 1 Evidence graph construction
Initialize $G_0$ with node $r$; $\delta(r)=0$, $Q=\{r\}$.
Expand $v=r$: collect atomic fact “Quantum X discovered by $e_1$” and derived fact “Quantum X belongs to field $f$”. Verify both, select the derived fact.
Add edge $(r,\text{belongs\_to},f)$; new node $w=f$ gets $\delta(w)=1$, $b=1$, $Q=\{f\}$.
Expand $v=f$: collect atomic fact “$f$ studied at institution $i$”. Verify and select it.
Add edge $(f,\text{studied\_at},i)$; new node $i$ gets $\delta(i)=2$, $b=2$, $Q$ remains empty (budget exhausted).
With $D=2$ and $B=2$, the graph forces a two‑hop reasoning chain, preventing a single‑clue shortcut while keeping the construction budget modest.
**Figure 2.** Overview of FORT, a shortcut-resistant synthesis pipeline.
Question formulation turns a selected subgraph $G^{*}$ into a natural‑language query, withholding intermediate entity names and fuzzing exact values to block exposed‑constant shortcuts. Adversarial refinement then runs a strong solver on the draft, repairing any remaining route‑level or prior‑binding shortcuts until the trajectory meets the desired solving cost.
Finally, the generated trajectories train FORT‑Searcher, which learns to acquire evidence over multiple turns before answering, and at inference time it manages context so that retrieved evidence is reused across turns, restarting only when the turn limit is reached without an answer.
Experimental Results
FORT‑Searcher sets new performance highs across benchmarks, especially with context management.
FORT‑Searcher attains the highest overall score of 66.2 across five benchmarks, beating the next best comparable‑size open‑source agent by 1.6 points.
Table 6 shows the overall average of 66.2, surpassing MiroThinker‑1.7‑mini (64.6) and Qwen3.5‑35B‑A3B (60.0).
**Figure 1.** Performance of FORT-Searcher against other search agents on BrowseComp and BrowseComp-ZH.
**Table 1.** Performance comparison of various AI agents across different benchmarks (BrowseComp, BC-ZH, xbench-05, xbench-10, Seal-0, and Overall).
**Table 7.** Effect of context management on FORT-Searcher across five benchmarks.
FORT‑Searcher consistently outperforms baselines on multi‑step search benchmarks.
Ablations and Trajectory Analysis
We dissect how each shortcut‑resistant component and refinement step shapes training difficulty and search behavior.
The central premise—that synthetic shortcuts can mask true search difficulty—still holds: we must look beyond raw trajectory length to see whether agents really expend effort.
A trajectory signature aggregates three observable signals—overall solving cost $Ω$, time to first answer hit $T_{\text{hit}}$, and prior‑shortcut rate $p_{\text{prior}}$—to diagnose how hard a question is for a search agent.
How do Trajectory Signatures differ from simply measuring trajectory length?
Length alone ($Ω$) tells how many steps were taken, but it ignores *when* the answer appears ($T_{\text{hit}}$) and whether the model guessed correctly without any evidence ($p_{\text{prior}}$). The three‑way signature therefore distinguishes “long but easy” from “long and genuinely hard”.
We first assess the contribution of each shortcut‑resistant component by a cumulative ablation on 2 K synthesized questions.
Next we examine whether adversarial refinement can recalibrate drafts that are either shortcut‑prone or initially unsolved.
We then compare FORT against existing open‑source deep‑search datasets using the same trajectory‑signature diagnostics.
Finally, we map the theoretical difficulty factors to observable trajectory‑level proxies on 200 successful questions per source.
Notation Summary
All symbols used in the difficulty framework are listed with their meanings.
Table 13 enumerates every symbol introduced in the paper and gives a concise plain‑English description.
**Table 13.** Summary of the main notation used in the shortcut-aware difficulty framework and trajectory diagnostics.
Formal Framework Details
Formal definitions, lower bounds, and illustrative shortcut cases for the difficulty framework.
For any task $q$, $D_{\text{post}}(q) \ge Q^{\star}_{\Sigma}$, where $Q^{\star}_{\Sigma} = \min_{P \in I_q} Q_{\Sigma}(P)$.
For any identifying subset $P \in I_q$, $Q_{\Sigma}(P) \ge \max\bigl( M_{\text{ev}}(P), \operatorname{dep}(P) \bigr)$.
If an identifying subset $P \in I_q$ can be verified by a single initially executable query, then $Q^{\star}_{\Sigma} = 1$.
Case 1 illustrates an evidence co‑coverage shortcut: a single retrieved snippet simultaneously supplies the answer entity and multiple answer‑side facts, collapsing the intended multi‑step evidence chain.
Case 2 shows a single‑clue selectivity shortcut: one highly selective clue (the full‑color TV broadcast) retrieves the correct year, after which the remaining constraints serve only as post‑hoc verification.
Case 3 demonstrates an exposed‑constant shortcut: a distinctive phrase (“want to work, not protest”) appears verbatim in the retrieved evidence, directly naming the target individual.
Case 4 reveals a prior‑knowledge binding shortcut: the model supplies the answer “Cleopatra” before any retrieval step, relying on its internal knowledge rather than evidence acquired during the search.