Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1 offloads search bookkeeping to a stateful environment, enabling 20B models to outperform larger agents.

How can we improve search agent performance by offloading state management from the model's internal context to an external, stateful harness?

Search agents typically force the model to manage both semantic search decisions and routine bookkeeping, such as tracking seen documents and verification status, within an append-only transcript. This dual burden makes reinforcement learning inefficient and poorly conditioned. Harness-1 introduces a stateful harness that maintains a persistent working memory, including candidate pools, evidence graphs, and verification records. The model acts as a policy that edits this explicit state rather than merely extending a raw text transcript. On eight retrieval benchmarks, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open-source sub-agent by 11.4 points and remaining competitive with frontier-scale models.

Paper Primer

The core mechanism is stateful cognitive offloading: the harness manages the "recoverable state" of a search episode, while the policy retains semantic control. The harness renders a structured view of the search progress—including importance-tagged documents and evidence-graph summaries—directly into the model's prompt, allowing the agent to perform targeted edits like curation and verification.

Harness-1 significantly improves retrieval recall across diverse domains.

Average curated recall of 0.730 across eight benchmarks (web, finance, patents, multi-hop QA). Outperforms the next strongest open-source sub-agent (Tongyi DeepResearch 30B) by +11.4 points.

The system's effectiveness is most visible in transfer performance: Harness-1 gains are 2.2× larger on held-out benchmarks not used during training compared to its source-family benchmarks. This suggests the policy learns generalizable operations over the harness interface rather than memorizing specific search patterns.

Why does the harness need to be "stateful" rather than just providing a better prompt?

A stateful harness allows the model to perform persistent edits (e.g., adding, removing, or importance-tagging documents) that update the environment's memory. This prevents the policy from having to reconstruct the search state from a growing, append-only transcript, which otherwise makes reinforcement learning unstable.

What happens if the harness is removed at inference time?

Ablation studies show that disabling harness mechanisms causes the policy to revert to a "wide, shallow, search-dominated mode." The model continues to search but loses the ability to prioritize or verify evidence, leading to significant drops in final-answer recall.

For retrieval-agent design, the interface is as critical as the model architecture; offloading bookkeeping to a stateful harness allows smaller models to achieve frontier-level performance by focusing compute on semantic decision-making.

Introduction: The Search Agent Bottleneck

Frames internal context overload and presents Harness‑1 as an external offloading solution.

Search agents are typically trained as policies that ingest an ever‑growing transcript of tool calls and observations. As the episode proceeds the model must both decide what to search next and keep track of which documents have been seen, which evidence is still useful, and which constraints remain open. This dual burden forces reinforcement learning to optimise semantic search behaviour and routine bookkeeping simultaneously, which we argue is inefficient and poorly conditioned.

Turn 1 → 2 KB stored (total 2 KB).

Turn 5 → 10 KB stored (total 10 KB), already beyond the 4 KB limit.

Turn 10 → 20 KB stored, requiring the model to drop earlier turns to stay within the window.

This illustrates why internal context quickly becomes a bottleneck for long‑horizon search, motivating an external state store.

We introduce Harness‑1, a 20 B search agent that delegates all recoverable state to a persistent external harness. The harness maintains a candidate pool, an importance‑tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and a budget‑aware context rendering. The policy therefore only decides what to search, which documents to keep, what to verify, and when to stop.

In practice the agent observes recent tool results together with a rendered WORKING MEMORY that summarises the full search state. Actions edit this state: curate adds, removes, or importance‑tags documents; verify checks policy‑written claims against remembered evidence; review re‑renders previously seen documents without issuing a new corpus call. The same renderer is reused for supervised‑fine‑tuning replay, RL rollouts, and evaluation.

**Figure 1.** Averaged performance across eight challenging search benchmarks. Each method reports: Curated Set Recall and Trajectory Recall (documents encountered anywhere in the episode). Harness-1 is our 20B open search agent trained with a stateful harness; it substantially improves over open search agents and remains competitive with much larger frontier-model searchers.

The key insight is that moving bookkeeping from the policy’s internal context to an external harness frees the agent to make higher‑quality semantic search decisions.

The Harness-1 Architecture

External harness offloads bookkeeping so the policy can focus on search decisions.

The policy prompt can hold only a few hundred tokens, yet a naïve search transcript quickly exceeds that budget, forcing the model to truncate useful information.

WorkingMemory splits the episode state into a prompt‑facing tier that shows a compact snapshot (candidate pool, curated set, importance tags, evidence graph, verification records) and an outer tier that stores the full‑text of every retrieved chunk.

Step 1: BM25 selects the top 4 sentences from each document, yielding 12 sentences (≈ 48 tokens) for $P_t$.

Step 2: Deduplication removes two identical sentences, leaving 10 sentences (≈ 40 tokens).

Step 3: The evidence graph $G_t$ extracts entities “Brussels”, “1878”, “Grande Synagogue” and records which documents contain each.

Step 4: The policy issues

Step 5: The harness updates $B_t$ to $= 18$ tokens after accounting for the added importance tag.

WorkingMemory lets the policy edit a bounded snapshot while still having unlimited raw evidence available for later verification.

One turn of the Harness‑1 loop.

**Figure 2.** **Overview of Harness-1.** The policy makes semantic decisions over search, inspection, curation, verification, and termination. The harness maintains the state around those decisions: document pool, importance-tagged curated set, evidence graph, verification records, compression and deduplication, result summaries, and budget markers. The same state interface is used for teacher rollouts, SFT replay, CISPO RL, and evaluation.

The image displays an evidence graph representation where entities are linked across multiple documents. The text describes how document 22816_0 connects three entities, while 91442_3 connects two. The boxed section lists specific entities (Brussels, 1878, Grande Synagogue) and the document IDs associated with them, noting the presence of singleton entities for potential future hops.

The separation of bookkeeping from semantic search lets the policy allocate its limited prompt bandwidth to high‑level reasoning instead of raw retrieval bookkeeping.

How does WorkingMemory differ from a generic external memory buffer?

A generic buffer merely stores data; WorkingMemory actively renders a compact, task‑specific snapshot (candidate pool, curated set, evidence graph, budget) into the prompt while keeping the full raw store hidden but addressable via explicit actions.

Training and Optimization

We first teach the model the harness interface, then let RL refine search decisions with a shaped terminal reward.

Because the harness already stores most bookkeeping, the model’s internal capacity is freed for semantic search decisions. The training pipeline exploits this split: a brief supervised phase teaches the interface, then reinforcement learning sharpens the search policy.

First we show the model how to talk to the external harness, then we let it discover better search strategies by rewarding whole‑episode outcomes.

Tool‑diversity term: $\min(\nu/\nu_{0},1)=\min(2/3,1)=0.667$.

Answer‑bonus term: $\mathbf{1}[\rho_{A}>0]=1$, so $B_{A}=0.1$.

Miss penalty: $w_{\text{miss}}(\rho_{\tau A}-\rho_{A})=0.5\,(0.45-0.4)=0.025$.

Turn penalty: $\pi_{\text{turn}}(t)=0.01\times5=0.05$.

Plugging into the formula yields $R = 1\cdot0.5 + 1\cdot2\cdot0.6 + 1\cdot0.667 + 1\cdot0.4 + 1\cdot0.45 + 0.1 - 0.025 - 0.05 = 2.542$.

This toy calculation shows how the reward balances multiple desiderata; increasing any single component (e.g., tool diversity) directly lifts the overall score.

How does this SFT differ from ordinary supervised fine‑tuning of a language model?

Standard SFT trains the model to predict the next token given a text prompt. Here SFT trains the model to emit correctly formatted tool calls, follow the “search→curate” rhythm, and respect verification rules—essentially teaching the model to operate the external harness rather than just generate text.

**Figure 4.** Training-data scale. Paired bars show reported unique task/query units used for SFT and RL.

Experimental Results

Harness-1 sets new open‑source benchmarks, achieving 0.73 average curated recall.

Harness‑1 is the strongest open‑source retrieval sub‑agent, achieving 0.730 average curated recall.

It surpasses the next best open model (Tongyi DeepResearch 30B) by +11.4 points on the eight‑benchmark suite.

**Table 2.** Search quality across benchmarks. We report curated-set Recall, Final-Answer Recall, and Trajectory Recall. Green and blue rows denote the overall best models among open-source small models and frontier models, respectively. Results are averaged over three runs.

**Figure 3.** **Transfer pattern.** Harness-1 gains more on held-out transfer benchmarks (+17.0 pts mean) than on source-family benchmarks (+7.9 pts), a 2.2x gap over Context-1.

**Table 1.** Ablation study of Harness-1 mechanisms.

**Figure 5.** **Training dynamics.** (a) Recall during RL training. (b) Mean tool diversity. Without the diversity bonus, recall plateaus and tool use collapses; with the bonus, diversity stabilises and final recall is higher.

Harness‑1 delivers consistently superior performance across a wide range of retrieval benchmarks.

Related Work

Survey of prior harness designs and agentic search methods.

Recent work treats the harness—the layer between a language model and its tools—as a critical design knob. Varying this interface can shift model performance by tens of points, so researchers have begun to study both fixed orchestration pipelines and learned, stateful harnesses.

Context‑1 is a lightweight, trainable module that prunes the agent’s retrieval context on‑the‑fly, letting the policy focus on the most relevant documents.

Design pattern where a fixed interface $I_0 = (A, O, T, r)$ is exposed to the policy; the policy learns to issue tool calls within this static contract.

Interleaves reasoning steps with tool calls by prompting the model to output “Thought:” and “Action:” lines.

Encourages the model to generate sub‑questions before answering, creating a shallow reasoning chain.

Combines information retrieval with chain‑of‑thought prompting, allowing the model to fetch evidence between reasoning steps.

Trains an LLM with reinforcement learning to issue queries directly to a real‑world search engine, learning both query formulation and result selection.

Jointly trains a single LLM to interleave reasoning and search calls under a question‑answering reward.

Decouples the search policy from the generator, training only the search component while keeping the generator frozen.

Stateful retrieval harness that externalizes bookkeeping (e.g., retrieved set $st = (P_t, C_t, I_t, D_t, G_t, V_t, H_t, B_t)$) from the policy.

Limitations and Ethics

We outline the practical limits, ethical considerations, and broader impact of Harness‑1, then detail its tool interface.

Recall that Harness‑1 offloads bookkeeping state to an external harness, letting the policy concentrate on semantic search decisions.

Limitations arise because the design targets evidence‑seeking retrieval tasks such as needle‑in‑a‑haystack or multi‑hop searches, leaving breadth‑oriented, open‑ended, or adversarial scenarios unsupported.

The harness relies on engineered components—a lightweight regex extractor for entities, an LLM‑based verifier for entailment, and a sentence‑level BM25 compressor—that can degrade in domains requiring full entity linking, handling highly technical claims, or preserving discourse‑level context.

Evaluation is constrained by the size and annotation quality of the benchmarks; confidence intervals, near‑duplicate qrels, and incomplete relevance judgments mean reported metrics reflect behavior under the specific benchmark and harness conditions, not universal search reliability.

Ethically, the work aims to make retrieval‑augmented systems more transparent by returning an explicit curated evidence set that can be inspected and verified.

However, a more capable search agent can retrieve sensitive, copyrighted, or misleading documents if connected to an unrestricted corpus, so deployment must include corpus access controls, logging, rate limits, privacy filters, and human oversight.

Crucially, Harness‑1 does not guarantee that the retrieved evidence is complete, unbiased, or correctly interpreted; high‑stakes domains still require expert review and validation against authoritative sources.

Broader impact includes reduced cost of evidence gathering, enabling applications such as scientific literature review, legal analysis, fact‑checking, and education.

Conversely, cheaper, stronger agents lower the barrier for privacy‑invasive searches, automated cherry‑picking of evidence, and large‑scale collection of sensitive information, so responsible use demands strict access restrictions and monitoring.

We will release only the retrieval subagent, harness implementation, data‑generation pipeline, and evaluation recipe, accompanied by documentation of intended use and dataset provenance; the system is not a general autonomous web agent.

Tool signatures exposed to the Harness‑1 policy include fan‑out search, single‑search, regex grep, document read, review of stored docs, curated set editing, claim verification, episode termination, and a no‑op prune operation for backward compatibility.

Harness Algorithms

Formal algorithms that drive Harness‑1’s state management across training and inference.

Without a disciplined update routine the policy quickly fills its context with low‑value bookkeeping, starving the search decision logic of useful capacity.

**Algorithm 1** Harness-1 stage driver. **Require:** query $q$, tools $\mathcal{T}$, stage $s \in \{\text{teacher, sft, rl, infer}\}$, policy $\pi$ or replay trajectory $\tau$ **Ensure:** trajectory record for training stages, or curated set $C$ for inference 1: Initialize $P, C, I, D, G, V, H, U \leftarrow \emptyset$ and $\text{AUTOSEEDED} \leftarrow \text{false}$ 2: Build a per-episode rerank instruction from $q$ and the dataset family 3: **for** $t = 1, \dots, T_{\max}$ **do** 4: $\quad x_t \leftarrow \text{RENDERCONTEXT}(q, P, C, I, D, G, V, H, \text{recent turns})$ 5: $\quad$ **if** $s = \text{sft}$ **then** 6: $\quad\quad$ Read stored action $a_t$ and stored observation $o_t$ from $\tau$ 7: $\quad$ **else** 8: $\quad\quad$ Sample a structured tool action $a_t \sim \pi(\cdot \mid x_t)$ 9: $\quad\quad$ Execute $a_t$ with tools $\mathcal{T}$ to obtain raw observation $o_t$ 10: $\quad$ **end if** 11: $\quad$ **if** $a_t$ is a retrieval or memory-inspection action **then** 12: $\quad\quad (o_t, P, D, G, U, C, I, \text{AUTOSEEDED}) \leftarrow \text{PROCESSOBSERVATION}(a_t, o_t)$ 13: $\quad$ **else if** $a_t = \text{curate}$ **then** 14: $\quad\quad (C, I) \leftarrow \text{CURATE}(a_t, C, I, P)$ 15: $\quad$ **else if** $a_t = \text{verify}$ **then** 16: $\quad\quad V \leftarrow V \cup \text{VERIFYCLAIM}(a_t, D)$ 17: $\quad$ **else if** $a_t = \text{review\_docs}$ **then** 18: $\quad\quad o_t \leftarrow \text{REVIEWDOCS}(a_t, D)$ 19: $\quad$ **else if** $a_t = \text{end\_search}$ **then** 20: $\quad\quad$ **break** 21: $\quad$ **end if** 22: $\quad$ Update search history $H$, result summary, token marker, and a working-memory snapshot 23: **end for** 24: Apply the stage-specific finalizer in Algorithm 6

Algorithm 1 driver – the thin loop that powers all stages.

Think of the harness as a librarian: each new observation is a batch of books that must be sorted, de‑duplicated, and shelved while the catalogue (the WorkingMemory) stays consistent.

Chunk $d_1$ yields 6 sentences; BM25 selects the top 4 and they are added to $P$ and stored in $D[d_1]$.

Chunk $d_2$ is flagged as a near‑duplicate by $U$; it is discarded and the deduplication counter becomes 1.

Chunk $d_3$ passes duplication check; its first 8 ranked IDs are $\{a,b,c,d,e,f,g,h\}$ and are inserted into $C$ with $I(\cdot)=\text{fair}$.

$AUTOSEEDED$ is set to true and an “[AUTO‑POPULATED]” notice is appended to the observation log.

The processed observation (sentences from $d_1$ and $d_3$) is returned together with the updated state.

Even with a tiny budget, the routine guarantees that only the most relevant sentences survive and that duplicate evidence never inflates the WorkingMemory.

How does this observation processor differ from a naïve “append‑all‑results” buffer?

The naïve buffer would grow unbounded, mix duplicate evidence, and never rank importance. The processor filters by BM25, removes near‑duplicates via $U$, and tags the first $k$ results with a fair importance level, keeping the WorkingMemory both compact and query‑relevant.

Imagine the curated set $C$ as a limited‑size shelf: when a new book arrives, you compare its importance to the least valuable book on the shelf and evict the latter if the newcomer is better.

Incoming IDs $A=\{d_6,d_7\}$ with proposed importance $J(d_6)=\text{very high}$, $J(d_7)=\text{fair}$.

Since $|C|=5$ equals $M$, the algorithm looks for the lowest‑ranked item $w$; $w=d_3$ (importance “low”).

For $d_6$, $J(d_6)$ (“very high”) outranks $I(w)$ (“low”), so $d_3$ is evicted, a [CAPACITY] marker is emitted, and $d_6$ is inserted with importance “very high”.

For $d_7$, $J(d_7)$ (“fair”) does not outrank the new lowest item $d_5^{\text{low}}$, so $d_7$ is rejected and a [CAPACITY] marker records the miss.

Resulting set $C=\{d_1^{\text{high}}, d_2^{\text{fair}}, d_4^{\text{fair}}, d_5^{\text{low}}, d_6^{\text{very high}}\}$, ordered by importance.

The curation routine preserves high‑value entries while providing a transparent signal whenever the capacity limit forces a discard.

Why not simply keep the most recent $M$ items instead of ranking by importance?

Recency alone ignores the semantic value of an item. An older “very high” evidence piece can be far more useful than a fresh “low” result. The importance‑aware policy ensures that the curated set retains the strongest signals regardless of when they arrived.

Together, the driver, observation processor, and importance‑aware curation give Harness‑1 a deterministic, capacity‑bounded WorkingMemory that the policy can rely on across teacher‑trajectory generation, SFT replay, RL rollout, and inference.

Teacher Agent Loop

Details of the curation loop, guidance, quotas, and deduplication mechanisms.

The appendix spells out every piece of the curation pipeline that the main text only alludes to, from the teacher’s turn‑level loop to the low‑level eviction policy that keeps the curated set tractable.

The teacher repeatedly sends the full conversation context (system prompt, working memory, the last $K=3$ turns, and summaries of older turns) to the large language model and receives a tool‑selection JSON.

How does this loop differ from a standard RL training loop that only returns a scalar reward?

Instead of a single numeric feedback, the teacher receives a structured JSON containing the chosen tool, its arguments, and a free‑form reasoning string. The harness then produces a concrete observation (e.g., a retrieved document) that becomes part of the next prompt, turning the loop into a full‑state interaction rather than a bandit‑style reward signal.

The generation script injects turn‑level guidance: it enforces a search→curate rhythm, nudges verification after six turns with at least three curated documents, triggers backtracking when the last three searches yield ≤ 1 new document, and hints at tool diversity if grep or read‑document remain unused after four turns.

Raw per‑dataset quotas (e.g., BC + 300, SEC / 250, Patents / 150, Web / 150, Web‑simple / 75, SEC‑simple / 75) yield roughly 1 K trajectories; a recall gate of 0.10 discards low‑recall runs, leaving 899 trajectories that expand to about 26 K training examples.

The first successful search automatically populates the curated set with the top‑$k$ reranked results ($k=8$), marked with an [AUTO‑POPULATED] token so the teacher knows to promote useful docs and prune irrelevant ones.

Search observations are processed like RL observations: BM25 selects the top‑4 sentences per chunk, MinHash‑LSH (64 permutations, 5‑gram shingles, Jaccard ≥ 0.85) deduplicates near‑duplicates, and a [Context: X/Y] token marks the observation’s provenance.

The policy assigns an importance tag (very high, high, fair, low) to each candidate document and evicts the lowest‑importance entry when the curated set fills up.

For each incoming document, read its importance tag (default “fair”) and map it to a numeric rank.

Identify the current curated document with the highest numeric rank (i.e., lowest importance).

If the incoming rank < worst rank, evict the worst‑ranked document and insert the new one.

If no document has a worse rank, reject the add and emit a [CAPACITY] status string listing up to five rejected IDs.

Incoming doc 606 arrives with tag “high” (rank = 1).

The worst‑ranked doc is any low‑tag entry (rank = 3); choose doc 404.

Since 1 < 3, evict doc 404 and insert doc 606 with its “high” tag.

Resulting set: very high 101, high 202, high 606, fair 303, low 505.

The policy never discards a higher‑importance document for a lower‑importance one, guaranteeing that the most valuable evidence remains reachable throughout the trajectory.

Before a new chunk enters the working memory, the system checks whether a near‑duplicate already exists and drops the duplicate to keep the memory concise.

Attempt to import

Otherwise, compute a SHA‑1 prefix of the first 4 000 characters and compare against a hash set.

In either case, if a match is found, suppress the incoming chunk and increment the [Dedup] counter.

Tokenise both chunks, produce 5‑gram shingles, and feed them to a MinHash with 64 permutations.

Compute Jaccard similarity of the shingle sets → 0.87 (above 0.85 threshold).

Chunk B is identified as a near‑duplicate of Chunk A and is suppressed; the [Dedup] counter increments to 1.

Even though the wording differs, the high shingle overlap signals that the information is already present, preventing redundant storage.

The ablation study (Table 3) isolates each state mechanism; removing any single component causes the policy to revert to a search‑dominated style, dramatically lowering curated recall.

Failure‑mode analysis shows that the importance‑tag system is the most critical: without it, the agent loses the gradient that guides it toward high‑value documents, causing a 4.1 % drop in recall and a 7.9 % drop in FA recall.

When importance tags are stripped, the agent’s action mix shifts: search‑corpus actions rise by 3–7 pp, while read‑document and verify actions collapse by factors of 2–6, confirming that the policy can no longer focus on promising candidates.

Similarly, disabling sentence‑BM25 compression removes the concise representation that the policy relies on for verification, leading to a steep decline in verification actions and a corresponding rise in blind searches.

Ablation Studies

Each ablation removes a Harness‑1 component and reports the resulting recall change.

Removing per‑return sentence compression leaves overall Recall essentially unchanged but cuts FA Recall by 7 %.

Recall $\Delta$ +0.2 % (13 hard fails); FA Recall $\Delta$ −7.0 % (13 hard fails). Read‑document rate drops from 4.1 % to 2.6 % and verify from 2.0 % to 0.4 % on failing queries.

Disabling the auto‑seed hurts FA Recall by 6.4 % while barely moving overall Recall.

Recall $\Delta$ −0.3 % (12 hard fails); FA Recall $\Delta$ −6.4 % (12 hard fails). Without an initial curated set the policy retains wrong documents during the cold‑start window.

Hiding the evidence graph reduces overall Recall by 2.6 % and FA Recall by 5.4 %.

Recall $\Delta$ −2.6 % (10 hard fails); FA Recall $\Delta$ −5.4 % (10 hard fails). Search actions rise from 88.9 % to 91.8 % and document reads fall three‑fold.

Read the original paper

Open the simplified reader on Paperglide