Masking Stale Observations Helps Search Agents – Until It Doesn't: A Regime Map and Its Mechanism

Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley

Observation masking in search agents follows a regime-dependent utility curve, peaking at moderate proficiency before collapsing.

Does masking stale observations in long-horizon search agents improve performance, and under what conditions does this strategy fail?

Long-horizon search agents accumulate massive context trajectories, forcing a trade-off between retaining historical observations and managing context budgets. The authors implement a minimal masking policy that replaces stale observations with placeholders, using this as a diagnostic tool to map how performance gains scale across different model and retriever combinations. They find that masking follows an asymmetric inverted-U shape: it provides significant gains for mid-tier systems but becomes counterproductive for saturated models that can already filter their own context.

Paper Primer

The core mechanism is a token-for-turn trade-off: masking removes observations the model has largely stopped attending to, reclaiming context budget at the cost of optional re-reading. It acts as a filter that aligns weaker models with the attention patterns of stronger ones, forcing them to anchor back to the initial plan rather than getting lost in middle-trajectory noise.

Masking utility is non-monotonic and governed by the system's baseline proficiency.

Across offline benchmarks, gains plateau at +6–7 pts with weak retrievers, peak at +11.7 pts for mid-capacity models, and collapse to ≤0 pts for saturated models. The performance delta ranges from a +11.7% accuracy boost to a -1.1% penalty depending on the model-retriever regime.

The "saturated collapse" occurs because advanced models have evolved to internalize noise-filtering; for these agents, masking risks evicting critical evidence that the model would otherwise have used, leading to a surge in costly re-search tool calls.

Why does masking help some agents but actively harm others?

Masking helps when it clears noise the model would not have used, but it backfires in saturated regimes where the model is already capable of filtering its own context; in these cases, masking removes evidence the model would have otherwise utilized, forcing it to perform redundant, expensive tool calls.

Is this a new architecture or a diagnostic study?

It is a diagnostic study. The authors use a minimal, turn-based masking policy as an instrument to map the interaction between retriever recall and model capacity, rather than proposing a new, complex context-management algorithm.

Context management is not a universal performance booster; engineers should pivot from aggressive heuristic pruning toward high-fidelity retrieval as their backbone models scale.

Introduction

Long‑horizon agents must trade context budget against retaining useful observations.

Long‑horizon search agents continuously append retrieved snippets, tool outputs, and reasoning steps, so by the end of a trajectory the context can span tens of thousands of tokens and become dominated by old observations. A minimal intervention is to mask stale observations—replacing them with placeholders—yet it is unclear when this helps and when it harms performance.

Observation Masking swaps out old tool outputs with generic placeholders, freeing context slots while keeping the surrounding reasoning and tool‑call structure intact.

Initial context length = $12$ tokens (three observations) + $4$ reasoning tokens = $16$ tokens.

Mask the oldest observation (the first $4$‑token block), replacing it with a $1$‑token placeholder.

New context length = $1$ placeholder + $8$ tokens from the two remaining observations + $4$ reasoning tokens = $13$ tokens, freeing $3$ tokens for additional reasoning.

This toy illustrates the token‑for‑turn trade‑off: masking frees space for new reasoning turns while discarding information the model has already stopped attending to.

How does Observation Masking differ from summarizing or compressing the context?

Summarization tries to distill the semantic content of old observations into a shorter text, which still consumes tokens and requires extra computation. Compression rewrites the text into a denser representation (e.g., via a learned encoder), again using tokens. Masking, by contrast, removes the tokens entirely and replaces them with a neutral placeholder, incurring virtually no compute overhead and guaranteeing that no residual semantic weight remains.

The core trade‑off is between preserving enough context to retain critical evidence and freeing budget for new reasoning turns.

The Masking Mechanism

Methodology details the observation‑masking trick and the surrounding scaffold for long‑horizon search.

In long‑horizon search the context $C_t$ grows linearly with the number of turns, and observations $o_i$ dominate the token budget, leaving little room for new reasoning.

Identify stale observations: $o_1$ and $o_2$ are older than $t-K=3$, so they are candidates for masking.

Check for errors: none of $o_1$–$o_2$ contain tool‑call errors, so both are replaced by $\tilde{o}$.

Render context $C_5$: keep $o_3$, $o_4$, $o_5$ unchanged; substitute $\tilde{o}$ for $o_1$ and $o_2$.

Resulting token budget: $|C_5| = 3(\tilde{o}) + 5 + 2 + 6 = 16$ tokens, versus $20$ tokens without masking.

Agent can still reference $o_1$ or $o_2$ via their page‑pool cursors if needed later.

Masking cuts the context size without losing the ability to revisit the underlying pages, isolating the effect of stale content on performance.

How does Observation Masking differ from simply truncating the context to the last $K$ tokens?

Truncation discards both the observation content *and* the structural metadata (tool name, arguments), making the page unrecoverable. Masking replaces the content with $\tilde{o}$ while preserving the metadata and keeping the page in the persistent pool, so the agent can later re‑open the page without a new search.

Render the current trajectory $H_{t-1}$ into context $C_{t-1}$.

Apply Observation Masking to $C_{t-1}$, yielding $C_{t-1}^{\text{masked}}$.

Policy $\pi$ samples the next reasoning chain $r_t$ and action $a_t$ conditioned on $C_{t-1}^{\text{masked}}$.

Execute $a_t$ in the environment, receive observation(s) $o_t$.

Update the page pool $P_t = P_{t-1} \cup \text{NEWPAGES}(a_t, o_t)$.

Append $(r_t, a_t, o_t)$ to the trajectory $H_t$ and re‑render to obtain the next context $C_t$.

**Figure 2.** **Left:** Observation masking at turn $t$. The most recent $K$ observations are retained in the visible context; earlier observations are replaced with a placeholder $\tilde{o}$. Reasoning chains $r_i$, and tool calls $a_i$ are never masked. Crucially, masking only removes an observation from the model's context. The underlying page remains in the page pool ($\S\$2.3) and stays reachable by its cursor, id, or URL, so the agent can re-open it even after $\tilde{o}$ replaces the original content. **Right:** Context composition averaged across the Qwen3.5/3.6 family (4B–35B) and three retrievers (BM25, Qwen3-Emb-8B, AgentIR-4B) on BrowseComp-Plus trajectories at termination. Environment observations $o_i$ account for over 85% of the overall content tokens, making them the natural target of compression.

Together, Observation Masking and the page‑pool scaffold let a long‑horizon agent keep its context lean while retaining full access to any previously fetched page.

Experimental Framework

We detail benchmarks, models, retrievers, and baseline settings for the experiments.

We evaluate on four agentic‑search benchmarks covering offline and live‑web retrieval settings and multiple languages.

A fixed‑corpus benchmark where agents must browse a static knowledge base to answer multi‑step queries, stressing long‑horizon search.

A dense retriever trained explicitly for agentic search, aligning retrieved passages with the agent’s internal reasoning signals.

We test open‑weight tool‑calling agents ranging from 4 B to 284 B parameters across several families.

Retrievers span classic sparse BM25, a mid‑tier dense embedding model, and a dense retriever tuned for agentic search.

Our No‑CM scaffold consistently outperforms the strongest publicly reported baselines, providing a conservative platform for measuring observation‑masking gains.

The Regime Map

Main results quantify observation‑masking gains across regimes and benchmarks.

This section presents the quantitative picture of observation masking, focusing on where it helps, where it hurts, and why those patterns emerge.

A regime map partitions the space of retriever quality versus model capability into three zones that predict whether masking will help or hurt.

How does this regime map differ from simply “more data = better performance”?

Because masking discards context, its effect depends on whether the retained context already contains enough correct evidence. In the bottleneck zone extra evidence is scarce, so masking cannot create it; in the saturated zone the model already extracts the signal, so discarding anything hurts.

The retriever bottleneck regime occurs when the upstream retriever fails to surface enough answer‑supporting passages, limiting any downstream benefit from masking.

Why doesn’t a stronger model simply overcome a weak retriever?

Because the model can only reason over what is present in the context. If the retriever fails to retrieve the correct passage, the model has no signal to act on, regardless of its capacity.

Observation masking delivers a peak accuracy gain of +11.7 pts when a strong retriever (recall 0.88) meets a moderately capable model (No‑CM 62.9 %).

Table 1 (left) shows the Qwen3.5‑35B‑A3B + AgentIR pair achieving the largest $\Delta$Acc.

The three regimes appear clearly in Figure 1: the left panel plots $\Delta$Accuracy versus baseline accuracy, revealing a low‑gain plateau (Retriever bottleneck), a pronounced peak (CM optimum), and a collapse into negative gains (Model‑saturated). The right panel’s scatter plots confirm that masking is most beneficial when input complexity is high and the fitted signal‑to‑noise ratio is low.

**Figure 1.** **Left:** Observation masking exhibits three context management (CM) regimes. CM adds little in the *Retriever bottleneck* plateau due to little answer-supporting evidence. CM is most efficient in the *Middle* as it strips away evidence that the model cannot yet filter from noise, and collapses when *Model Saturated* due to evicting crucial signals. **Right:** CM helps when the signal is sparse, and the input is complex. Each point represents a sampled No-CM input prefix. The x-axis is the first principal component over input trace features, where increasing values capture greater complexity; the y-axis is the normalized fitted SNR. Green and red dots denote CM-rescued and unchanged cases, respectively. *Saturated models* exhibit more separable rescue subsets. Whereas, the *retriever bottleneck* weakens the baseline signal, sharply suppressing this separability.

Scale alone does not dictate the regime: Qwen3.5‑35B‑A3B and Qwen3.6‑35B‑A3B share architecture and parameter count yet fall in different zones (+11.7 pts vs +3.7 pts), highlighting the role of training dynamics.

Live‑web retrieval further sharpens the collapse. On the GAIA benchmark, GPT‑OSS‑120B’s masking effect flips from a negligible +0.1 pts on BrowseComp‑Plus to a detrimental ‑4.8 pts, reflecting noisier, less controllable context.

**Figure 3.** Our scaffold consistently achieves higher No-CM accuracy on BrowseComp-Plus and GAIA than the best publicly reported results for matched model–retriever pairs (Chen et al., 2025c; Team et al., 2025b; Li et al., 2026a; Chen et al., 2026).

Table 1 (right) repeats the $\Delta$Accuracy pattern on three additional live‑web benchmarks (GAIA, xBench, BrowseComp‑ZH), confirming that the regime behavior is task‑dependent rather than model‑specific.

**Table 1.** The regime map of observation masking. Left: BrowseComp-Plus results across model–retriever pairs, reporting accuracy (Acc.) with and without CM, retriever recall under CM with respect to the gold documents, the additional tool calls per query induced by CM ($\Delta$ calls/q), and the accuracy gain from CM ($\Delta$Acc.). Accuracy values marked with $^\dagger$ are evaluated with stale-observation masking; **bold** indicates $\Delta$Acc. $\ge$ 8.0 and the best value in each remaining column. The largest $\Delta$Acc. values occur when less capable models are paired with strong retrievers, while the gains decay sharply once No-CM accuracy is sufficiently high. Right: The same $\Delta$Acc. trend evaluated on three additional benchmarks: GAIA, xBench, and BrowseComp-ZH.

Ablations and Trade-offs

Ablation analysis quantifies how Observation Masking trades accuracy for token and turn costs and validates scaffold design.

We first refresh the core premise: masking stale observations reclaims context budget, but its net accuracy gain masks heterogeneous query‑level effects.

**Figure 4.** Comparison of per-query change in rolling input tokens and additional turns induced by CM on BrowseComp-Plus. From left to right: Qwen3.5-35B+AgentIR (tokens, turns) and GPT-OSS-120B+AgentIR (tokens, turns).

Comparing the high‑gain Qwen3.5‑35B‑A3B (+11.7 pts) with the saturated GPT‑OSS‑120B (+0.1 pts) reveals two patterns: masking saves tokens on queries that flip from wrong to correct, yet broken queries (correct→wrong) consume many more tokens and turns because the agent must re‑search.

The probe learns a low‑dimensional signature of a prefix’s evidence density so that, before masking, we can tell whether masking will rescue the query.

How does this regression probe differ from a standard retrieval‑quality metric?

Standard metrics (e.g., BM25 score) evaluate a single retrieved document, whereas the probe aggregates all pages seen up to a prefix and explicitly balances gold hits against noisy observations, capturing the trajectory‑wide signal that determines whether masking will be helpful.

Removing error retention (masking tool‑call errors) raises the open‑error rate by +4.0 % for Qwen3.5‑4B and +4.2 % for Qwen3.5‑9B.

Table 2 shows error rates increase from 18.60 % to 22.61 % (4B) and from 20.41 % to 24.56 % (9B) when error retention is omitted.

Replacing exact URLs with blurred titles raises the open‑error rate by +2.2 % for Qwen3.5‑4B and +5.6 % for Qwen3.5‑9B.

Table 2 reports 20.80 % (4B) and 26.20 % (9B) error rates for the blurred‑title variant, compared with the baseline 18.60 % and 20.41 %.

Attention analysis (Section 5.3.1) shows reasoning tokens attract ~53.7 % of attention while observations attract only ~25.6 %; observation attention is front‑loaded, reasoning attention forms a U‑shape that revisits early reasoning steps.

Re‑open distance analysis (Section 5.3.2) reveals a bimodal pattern—agents mostly re‑open the most recent or the very first page; CM amplifies the first‑page re‑open frequency, which aligns with its larger gains.

Regression Probe Analysis

Evaluates how well a simple regression predicts trace‑SNR and separates rescued prefixes.

We build a lightweight probe that predicts the signal‑to‑noise ratio (SNR) of a trace from cheap, hand‑crafted features and then visualises each prefix in a two‑dimensional space.

The probe treats each partial execution trace as a point whose coordinates capture how “noisy” the observation history is, enabling a simple linear separator to tell whether masking stale observations will help.

How does this probe differ from a full‑blown neural SNR estimator?

Instead of feeding hidden states through a deep network, the probe uses only inexpensive, hand‑crafted trace statistics and a linear ridge model. This makes it fast enough to evaluate millions of prefixes while still capturing the monotonic relationship between trace complexity and SNR.

**Figure 5.** Trace-SNR regression probe. From left to right: BrowseComp-Plus observed/fitted gold-document signal; BrowseComp-Plus CM-rescue separability for Qwen3.5-9B+Qwen3-Emb-8B; xBench live-web proxy fitting-SNR for Qwen3.5-9B; and xBench live-web proxy fitting-SNR for DeepSeek-V4-Flash-Max. In scatter plots, green and red points denote the CM-rescued and the unchanged. A higher AUC score means a more separable prediction of whether CM rescues. xBench uses final-answer citation lines as a proxy because gold-document qrels are unavailable.

Prefixes that become correct after Observation Masking are far more linearly separable than unchanged prefixes.

Figure 5 reports AUC = 0.63 for Qwen3.5‑9B+Qwen3‑Emb‑8B, 0.70 for Qwen3.5‑9B, and 0.71 for DeepSeek‑V4‑Flash‑Max.

Attention Analysis

We dissect how agents distribute attention over their interaction history.

To expose where an agent spends its attention, we record the query and key tensors of every full‑attention layer while it processes an entire browsing trajectory.

We run each trajectory with vLLM, attaching a forward hook to every full‑attention module. The hook logs the inputs to the attention sub‑module (the $Q_{\ell}$ and $K_{\ell}$ tensors) for every token in “all‑tokens” mode, so a single prefilling pass yields the complete attention map for the whole episode.

Tokens are first labeled by role (system, user, reasoning, tool call, observation) using the chat‑template delimiters. Consecutive role‑segments are merged into interaction turns: each turn consists of a reasoning span followed by its tool‑call/observation pair; the initial turn also contains the system and user prompts.

**Figure 6.** Masking stale observations is relatively safe because the model does not attend to it extensively. Left: Attention in a single trajectory, separated into reasoning tokens (blue, left-bottom) and observation tokens (orange, upper-right). Middle: Attention-weight distributions aggregated by relative position across input contexts of different lengths. The cumulative attention share bars are the mean share at each step of the three models.

**Figure 7.** Agents reopen middle pages much less often. Relative positions of open targets in the current page pool; CM sharpens this U-shaped pattern.

Extended Results

Quantitative analysis of Observation Masking across three performance regimes.

We now quantify how Observation Masking (CM) behaves across three empirically observed regimes.

Observation Masking yields a maximum accuracy gain of +11.7 % in the mismatch regime.

Table 1 shows Qwen3.5‑35B‑A3B+AgentIR achieving +11.7 % while recall climbs to 0.78.

Under the sparse BM25 retriever, gains form a near‑constant low plateau: +6.3 % (Qwen3.5‑4B), +6.6 % (Qwen3.5‑9B), and +6.2 % (GPT‑OSS‑20B).

When the base No‑CM accuracy exceeds ≈70 %, the benefit collapses: +2.6 % (OpenResearcher‑30B‑A3B), +3.7 % (Qwen3.6‑35B‑A3B), +0.1 % (GPT‑OSS‑120B), and a negative −1.1 % (Tongyi‑DeepResearch).

Scale alone does not dictate the regime. Qwen3.5‑35B‑A3B (mismatch, +11.7 %) and Qwen3.6‑35B‑A3B (saturated, +3.7 %) share architecture and parameter count; the differing training recipe shifts them between regimes.

The regime boundary generalizes across benchmarks. GPT‑OSS‑120B is saturated on BrowseComp‑Plus (+0.1 %) but gains +8.0 % on xBench and suffers −4.8 % on GAIA‑text, reflecting task‑specific base accuracy.

GAIA‑text highlights an asymmetry: the 4 B model gains only +1.9 % despite +6–+11 % gains on BrowseComp‑Plus, because live‑web retrieval introduces noisier context that the weak reader cannot exploit even after masking.

Case Study: Parallel Tool Use

Parallel tool calls let the agent retrieve and answer a complex query in one turn.

Qwen3.5‑9B correctly reported that William Larnach had consolidated **404 ha** of land by 1898.

Answer 404 ha derived from the 2012 archaeological publication after the model performed parallel browser searches.

This case demonstrates that a single assistant turn can orchestrate multiple browser tools to answer a multi‑facet historical query, compressing a long reasoning trace into a concise answer.

Case Study: Long Parallel Search

Observation masking lets DeepResearch‑30B‑A3B answer a complex film query correctly.

DeepResearch‑30B‑A3B with observation masking produces the exact answer “Brazuca” for the complex film query.

The answer was judged CORRECT after the model reasoned over archived tool results and verified director details.

Without CM the model would have seen the full raw tool dump, which includes noisy intermediate results; masking those observations streamlines the reasoning path and avoids distraction.

Case Study: Multi-hop Reasoning

Observation masking yields high confidence on deep search but can flip answers on browse‑comp tasks.

With Observation Masking enabled, GPT‑OSS‑120B answered the four‑hop Chinese‑history question correctly with 98 % confidence.

Model judged CORRECT, confidence 98 % (qid 38) on the xbench‑DeepSearch benchmark.

Observation Masking shines when older observations become stale, as in the deep‑search case, but it can backfire on tasks that require consistent citation chaining, exemplified by the BROWSECOMP‑PLUS query where enabling CM flips the answer.

Turning Observation Masking on flips the answer on the BROWSECOMP‑PLUS query: CM OFF yields a correct answer, while CM ON yields an incorrect answer.

CM OFF (qid 171) judged CORRECT after 19 rounds; CM ON judged WRONG on the same question.

Case Study: Fact Retrieval

The interview confirms the relationship ended in June 2023 with high confidence.

Brown Mauzo’s relationship with Vera ended in June 2023.

The NTV interview directly states that Brown Mauzo’s relationship with Vera ended in June 2023 and was announced in August 2023.

Although the breakup was publicly announced in August 2023, the interview clarifies that the relationship actually ended two months earlier, in June 2023.

Case Study: Context Management Failure

This section examines how Observation Masking rescues a search agent that otherwise stalls on stale context.

The central premise is that agents waste context on stale observations; masking those observations can restore efficiency, but only when the retriever and task evidence lie in a favorable regime.

In the CM‑off baseline the agent fixates on a plausible‑but‑wrong band (Nass el Ghiwane) and never recovers within its allotted context budget.

When Observation Masking (CM) is enabled, the working context is automatically archived—raw observations are replaced by placeholders—allowing the agent to keep exploring non‑English sources and eventually surface the correct Sankomota / Frank Leepa answer.

The retrieval component employed is BrowseComp‑Plus, a hybrid search that combines dense vector lookup with lexical expansion.

Question: “A guitarist formed a band in high school in the 1970s, died in 2003, the same year a lead singer from another band died; the band’s music was conscious and multilingual. What is the guitarist’s name?” The gold answer is Frank Leepa.

Running the same query identifier (qid 951) without masking (CM‑off) the agent was judged WRONG after 24 reasoning rounds. Turn 1 launched two parallel web searches targeting “guitarist died 2003” and “band sang in several languages”.

Turns 2–8 consisted of repeated query reformulations that kept returning 2023 deaths instead of the target 2003 deaths, causing the agent to oscillate without locking onto the correct clue.

Turns 9–23 followed the Nass el Ghiwane thread: the model gathered details about the Moroccan folk‑pop group, its 1971 formation, and its socially conscious repertoire, then mistakenly concluded its guitarist Ali Benfarha was the answer.

In Turn 24, after the context was pruned by Observation Masking, the agent re‑evaluated the evidence, identified the correct guitarist Ali Benfarha, and produced the right answer, demonstrating the regime‑dependent benefit of masking.

Related Work and Conclusion

We situate our findings within prior work and outline their broader implications, limitations, and ethical considerations.

Long‑horizon agentic search has shifted from single‑step retrieval to multi‑turn, tool‑augmented exploration, prompting benchmarks such as BrowseComp, BrowseComp‑Plus, GAIA, xBench‑DeepSearch, HLE, and BrowseComp‑ZH that require agents to retain state across dozens of tool uses.

These agents rely on retrievers tuned for agentic reasoning, which dramatically expand the observation context and exacerbate long‑context degradation and “lost‑in‑the‑middle” effects.

Prior work on context management draws from working‑memory ideas (e.g., MemGPT) and proposes heuristics such as static truncation, heuristic eviction, and online compression to keep histories tractable.

Adaptive approaches like AgentFold and ReSum add flexibility but introduce extra computation or reduce KV‑cache reuse, motivating our minimal diagnostic setup to isolate the impact of observation dynamics.

Our analysis shows that observation masking is beneficial only in a non‑monotonic regime where enhanced retriever recall outpaces the model’s intrinsic noise‑filtering capacity; outside this sweet spot, masking either lacks sufficient evidence (weak retrievers) or discards information that capable models could exploit (saturated models).

Read the original paper

Open the simplified reader on Paperglide