SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen, Zhiqiang Zhang, Jun Zhou

SearchSwarm trains LLMs to manage context by delegating subtasks to independent sub-agent instances.

How can we enable LLMs to handle long-horizon research tasks that exceed their context windows by training a "main agent" to intelligently delegate subtasks to specialized subagents?

Large language models struggle with long-horizon research because their finite context windows fill up with raw, low-value search results before the model can synthesize a final answer. SearchSwarm addresses this by training the model to act as an orchestrator that decomposes complex tasks and dispatches sub-questions to independent sub-agent instances, which return only condensed, citation-backed reports. This delegation intelligence allows a 30B-parameter model to outperform significantly larger frontier systems on deep research benchmarks.

Paper Primer

The core mechanism is a harness-guided training loop that forces the model to treat delegation as a form of content-aware context management. Instead of passive truncation, the model learns to write a "brief" for a sub-agent—containing the research rationale and current progress—and receives back a summarized report, effectively compressing the research trajectory into a high-density update.

SearchSwarm-30B-A3B achieves state-of-the-art performance among models of its scale.

The model scored 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, outperforming previous best-in-class lightweight models like MiroThinker-1.7-mini. The training yields a 24.7-point absolute improvement over the base model without delegation intelligence.

The model's delegation capability generalizes: even when the sub-agent tool is disabled, the model performs better than the base model, suggesting that the training data successfully internalized structured problem decomposition and methodical evidence verification.

Why is this approach superior to simply summarizing the entire history when the context window fills up?

Passive summarization is reactive and often discards information indiscriminately. SearchSwarm’s delegation is proactive: the model plans the decomposition in advance and forces sub-agents to return only evidence-grounded, citation-backed reports, keeping the main agent's context focused on high-level coordination rather than raw retrieval.

Does the model require a multi-agent system architecture to function?

No. The sub-agents are simply the same model invoked in a fresh, independent context. The "multi-agent" behavior is an emergent property of the model managing its own context through tool-based delegation.

The model's effectiveness relies on the "briefing" requirement: sub-agents are not just given a task, but the full research context (what has been tried, what remains uncertain), which prevents redundant exploration and aligns the sub-agent with the main research goal.

For researchers building agentic systems, this paper demonstrates that delegation intelligence can be internalized into model weights via supervised fine-tuning, turning context management from a fixed-rule heuristic into a learned, model-driven capability.

The Challenge of Long-Horizon Research

We expose the context‑window bottleneck and introduce delegation intelligence.

Large language models (LLMs) are increasingly used for complex, long‑horizon real‑world tasks, but their context windows are inherently finite, creating a hard bottleneck when the required context grows without bound.

Existing approaches react only after the window is exceeded—summarizing history or discarding older tool outputs—so they lack foresight and often discard useful information.

A more active strategy lets a main agent plan ahead, decompose a task, and dispatch bounded subtasks to independent subagents, receiving only concise results that fit within the main agent’s context budget.

The main agent must know *what* to split, *when* to split, and *how* to stitch the subagents’ answers back into its own reasoning.

How does delegation intelligence differ from a simple “split‑and‑run” strategy?

Simple splitting treats every subtask as independent and always delegates, ignoring the main agent’s remaining context. Delegation intelligence, by contrast, predicts *whether* a split is worthwhile and *how* to phrase the subtask so the subagent can produce a concise, citation‑rich answer that the main agent can absorb without exceeding its window.

Training data for this capability is scarce because natural text rarely contains explicit multi‑agent coordination, so we construct a harness that elicits high‑quality delegation behavior at inference time.

The harness records the main agent’s calls to a `call_sub_agent` tool, including task briefs and rationales, then filters trajectories that exhibit correct delegation decisions; these filtered traces become supervised fine‑tuning data.

Fine‑tuning on this data yields SearchSwarm‑30B‑A3B, which reaches 68.1 on BrowseComp and 73.3 on BrowseComp‑ZH—state‑of‑the‑art performance among models of comparable scale.

The context‑window bottleneck in long‑horizon research drives the need for delegation intelligence.

The SearchSwarm Architecture

2 Method

SearchSwarm follows a main-distributes, sub-executes paradigm: the main agent plans and delegates bounded subtasks to independent subagents, then integrates their condensed reports. We first formalize the setting (Section 2.1), then describe the harness that elicits high-quality delegation (Section 2.2), and finally how its trajectories are internalized into model weights via supervised fine-tuning.

2.1 Formulation

We model the deep research task as a multi‑turn interaction between an agent and a tool‑equipped environment. Given a user question q, the agent issues tool calls over multiple steps to gather information and produces an evidence‑grounded answer y. We adopt the ReAct (Yao et al., 2022) framework to organize the interaction. Each step t consists of three components:

• Thought ($\tau$ₜ): The agent’s internal reasoning, including analyzing available evidence, identifying information gaps, assessing the plausibility of current hypotheses, and planning the next action. $\tau$ₜ serves as a compact representation of the interaction history that guides action selection.

• Action (aₜ): A tool call executed by the agent. The action space includes standard information retrieval tools and `call_sub_agent`.

• Observation (oₜ): The result returned by the environment after executing aₜ.

A complete trajectory is recorded as:

HT = q, ($\tau$₀, a₀, o₀), …, ($\tau_{T}$, $a_T$, $o_T$), y.

At each step, thought and action are sampled from the policy:

$\tau$ₜ, aₜ ∼ $\pi$(· | q, $H_{t-1}$). (1)

The final answer is generated from the accumulated evidence: y = g(q, HT). When evidence is incomplete or contradictory, y should explicitly reflect uncertainty.

Delegation. When aₜ = `call_sub_agent`(b), the agent delegates a subtask for execution. The brief b contains a subtask description and relevant context extracted from the agent’s current reasoning. It triggers an independent sub‑trajectory:

`H_sub` = b, ($\tau_{s0}$, $a_{s0}$, $o_{s0}$), …, ($\tau_{sS}$, $a_{sS}$, $o_{sS}$), r. (3)

which executes in a separate context conditioned solely on b, with no visibility into the main agent’s history $H_{t-1}$. Upon completion, the sub‑trajectory produces a report r, and the main agent receives:

oₜ = r. (4)

The main agent observes only the final report; the intermediate steps of `H_sub` are not visible.

Delegation as context management. In long‑horizon tasks, the agent’s context grows continuously as tool calls accumulate, necessitating management strategies. Existing approaches address this through various mechanisms: discarding history beyond a threshold, retaining only the most recent few rounds of tool calls, or compressing the trajectory into a summary (Liu et al., 2025; Zeng et al., 2026; MiroMind Team, 2026).

Although our method dispatches work to sub‑agents, it involves only a single model: the sub‑agents are the same model invoked in independent, fresh contexts, not separate or additional models. When `call_sub_agent` is invoked, the next reasoning step is conditioned only on the brief b, not the full history $H_{t-1}$, retaining only the information the agent deems essential for the subtask; after execution completes, what re‑enters the main context is the report r, a compressed summary of the entire sub‑trajectory. Both the brief and the report are generated by the model, rather than determined by fixed rules. Our approach can thus be considered as single‑agent context management rather than a multi‑agent system: the only difference from prior context‑management methods is that the model manages its own context more intelligently, using the model‑generated brief and report as a content‑aware compression in place of fixed‑rule truncation or summarization. Comparisons with such methods are therefore made on equal footing.

2.2 Harness Design

We design a harness comprising a tool set and system prompts for the main agent and subagents that guides an LLM toward high‑quality delegation behavior. This section describes the tool interface and core design principles. Full system prompts are provided in Appendix B.

Tools. The agent is equipped with the following tools: search submits queries to a search engine and returns ranked results with titles, URLs, and snippets; visit accesses a specified URL and extracts page content; `google_scholar` retrieves academic literature; python provides a code execution environment for numerical computation and data processing. These form the base information‑retrieval capabilities. On top of them, we introduce `call_sub_agent` as the core delegation tool: the main agent submits a brief, and the subagent executes in an independent context and returns a report. Subagents are equipped with the same standard tools but do not have access to `call_sub_agent`, limiting delegation to a single level.

Encouraging delegation. Because the main agent’s context is finite, every token it spends on raw search and visit outputs competes directly with the planning, verification, and synthesis that only it can perform. Multi‑step information gathering is precisely the kind of work that is token‑expensive but cognitively shallow: it can take many turns to surface a single fact. The harness therefore directs the main agent to hand such gathering to subagents, which pay the exploration cost in their own contexts and return only a condensed result, keeping the main agent’s limited attention on high‑level coordination. The main agent gathers information itself only when a subtask is shallow enough that the overhead of delegating would outweigh the context it saves.

Comprehensive briefing. Subagents start in a fresh context with no knowledge of prior investigation progress. The brief is the sole channel through which a subagent receives context, and its quality directly determines subagent effectiveness. When a brief contains only a simple task instruction, subagents tend to search aimlessly or re‑investigate facts the main agent has already confirmed, producing results that fail to advance the overall investigation. We therefore require the main agent to write each brief as if addressing a new collaborator joining the investigation: beyond the subtask description, the brief includes why this subtask matters to the overall question, what has been established so far, what remains uncertain, and which directions have been tried or ruled out. This aligns the subagent with the main agent’s research progress, ensuring its output contributes maximally to the overall investigation.

Main agent retains core judgment. The main agent is the only entity with a complete view across all subtasks, and only it can judge whether a subagent’s findings are consistent with other findings and the overall evidence landscape. If subagent reports are trusted without scrutiny, errors propagate and accumulate, undermining the coherence of the overall reasoning. The harness therefore requires subagents to focus on gathering evidence and testing specific hypotheses, while all directional decisions are made independently by the main agent, including which hypothesis to pursue, when to terminate the investigation, and how to adjudicate between conflicting reports.

Citation‑grounded reporting. Under the delegation architecture, the main agent cannot observe a subagent’s intermediate execution. If a subagent’s report does not cite its sources, the main agent cannot distinguish well‑supported conclusions from hallucinations or misinterpretations. We therefore require subagent reports to attach inline citations to every important conclusion, pointing to specific source URLs, enabling the main agent to verify the reliability of reported findings. The main agent’s final response likewise includes inline citations, providing end‑to‑end traceability from sources to conclusions for the user.

2.3 Supervised Fine‑tuning

Data Collection. To train a model that can both delegate effectively and execute delegated tasks, we require trajectories exhibiting both behaviors. We source queries from the open‑source RedSearcher (Chu et al., 2026) and OpenSeeker (Du et al., 2026) datasets. The model executes deep research tasks on these queries under harness guidance, and we record the complete execution trajectories, including thinking, tool calls, and environment returns, as training data. We use two configurations for trajectory collection. In the first, a single model serves as both main agent and subagent, and trajectories from both roles are retained. In the second, a stronger model serves as the main agent paired with a weaker subagent, and only main‑agent trajectories are retained.

The rationale for the second configuration is that less reliable subagent results force the main agent to exercise tighter control over the research mainline, producing trajectories with more deliberate task decomposition and more rigorous result verification. Data from both configurations are mixed to form the training set. The main‑agent context window is set to 128 K tokens and the subagent to 64 K. When a trajectory approaches the context limit, the model is prompted to produce a final answer immediately. We retain these forced‑answer trajectories rather than discarding them, so that the model learns to deliver high‑quality responses when the same forced‑answer mechanism is triggered at test time.

Filtering. We retain only main‑agent trajectories with correct final answers. Subagent trajectories are kept only when the corresponding main trajectory is correct, and overly short subagent trajectories are downsampled. Samples containing undesirable behavior patterns are removed, including repeated identical tool calls, hallucinated citations to nonexistent sources, and tool misuse such as web‑access attempts through the python interpreter.

Training Objective. Let a trajectory $\tau$ = (a₁, o₁, a₂, o₂, …, $a_T$, $o_T$) consist of alternating model outputs aₜ (thinking and tool calls) and environment returns oₜ (tool results, including subagent reports). We fine‑tune the base model via next‑token prediction with environment masking:

L = −∑_{t=1}^{T} |aₜ| ∑_{j=1}^{|aₜ|} log p_$\theta$ (aₜ^{(j)} | aₜ^{(<j)}, $\tau$<t). (5)

where aₜ^{(j)} is the j‑th token of the model output at step t, and $\tau$<t = (a₁, o₁, …, $a_{t-1}$, $o_{t-1}$) is the preceding context. The loss is computed only over model outputs aₜ; all environment returns oₜ are masked. This applies uniformly to both main‑agent and subagent trajectories, training the model to produce appropriate reasoning and tool invocations given the observed context without memorizing environment content.

Empirical Evaluation

SearchSwarm outperforms peers at the same scale and rivals much larger models.

SearchSwarm delivers state‑of‑the‑art results among all 30B‑A3B‑scale models while staying competitive with systems more than ten times larger.

Table 1 shows SearchSwarm scoring 68.1 (BrowseComp), 73.3 (BrowseComp‑ZH), 82.5 (GAIA), and 80.8 (xbench‑DeepSearch‑2505), surpassing every lightweight peer and matching or exceeding larger closed‑source baselines.

BrowseComp measures a model’s ability to locate, retrieve, and synthesize information from the web to answer a multi‑step research question.

How does BrowseComp differ from a standard retrieval‑augmented generation benchmark?

BrowseComp requires the model to orchestrate multiple search‑and‑cite cycles within a single answer, not just retrieve a single passage and generate.

GAIA evaluates a model’s capacity to solve math‑heavy research questions that combine retrieval, calculation, and reasoning.

Why does GAIA stress context length more than BrowseComp?

GAIA interleaves retrieval with multi‑step calculations, so the model must retain intermediate numeric values while still accessing new evidence.

The harness ablation (section 3.3) shows that merely adding the `call_sub_agent` tool yields a modest gain (+2.3 pts), whereas the full harness—combining delegation prompts, citation grounding, and reporting conventions—adds a decisive +10 pts.

Training Qwen3‑30B‑A3B‑Thinking on the same delegation data (section 3.4) reaches 66.5 pts on BrowseComp, confirming that the data alone can endow a different base model with strong deep‑research ability.

When the `call_sub_agent` tool is disabled (section 3.5), SearchSwarm still outperforms the unmodified base model by 8.5 pts on BrowseComp, indicating that the learned decomposition skills transfer to a single‑agent regime.

Open‑ended benchmarks (section 3.6) reveal a 14.2‑point average lift over the base model, with especially large gains on ScholarQA‑v2 (+32.7) where thorough multi‑source synthesis is required.

Tool‑usage analysis (section 3.7) shows the main agent invokes `call_sub_agent` in >70 % of steps on BrowseComp, while sub‑agents dominate search calls (up to 76 %). This confirms that the model has internalized the intended delegation pattern.

**Figure 1.** Performance comparison of SearchSwarm against lightweight models of comparable scale and larger closed-source/open-source models on four benchmarks. SearchSwarm achieves the best results among all models at the same scale and remains competitive with models over 10× larger.

SearchSwarm’s delegation‑driven training lets a 30B‑parameter model rival systems that are ten times larger.

Contextualizing Delegation

SearchSwarm extends LLMs by delegating bounded subtasks to subagents, preserving context.

Delegation Intelligence treats the LLM context window as a limited cognitive resource, mirroring how humans offload work when a task exceeds individual capacity. Prior multi‑agent systems such as Anthropic (2025a), Kimi Team (2026) Agent Swarm, Huang et al. (2026a), and Ruan et al. (2026) propose high‑level architectures but stop short of delivering a complete harness, data pipeline, and training recipe. Our contribution fills that gap by openly releasing the full delegation pipeline and model weights.

Recent LLM agents such as Claude 4.7, GPT 5.5, Gemini 3.1, DeepSeek V4, Qwen 3.7, GLM 5.1, Kimi 2.6, and Ring 2.6 demonstrate tool use and multi‑turn reasoning. As tasks grow and demand unbounded context, delegating subtasks to subagents offers a principled way to preserve the main agent’s context budget. Our work is among the first open‑source systems to operationalize this delegation at scale.

Because a model’s parameters are a lossy, static snapshot of world knowledge, accessing up‑to‑date information requires external search. Existing search agents—Tongyi DeepResearch, RedSearcher, MiroThinker, and OpenSeeker—focus on query formulation and tool design, but they expose raw tool outputs to the main agent. By treating subagents as callable tools that return concise summaries, our approach shields the main agent’s context while enabling iterative retrieval.

Behavioral Analysis and Tool Usage

Behavioral ablations expose how delegation components shape tool use and answer quality.

This appendix quantifies the behavioral consequences of the delegation mechanisms introduced in the main paper. By ablating specific components we observe how tool‑selection patterns and the number of sub‑agent invocations correlate with answer correctness.

**Figure 3.** Tool usage distribution on four benchmarks. (a) The main agent delegates extensively via `call_sub_agent`; its direct tool use is dominated by visit for verification. (b) Subagents focus on search and visit for information gathering.

**Figure 4.** Distribution of `call_sub_agent` invocation counts per question.

Main Agent System Prompt

Full system prompts and accuracy tables for the main agent and subagents.

Appendix B lists the accuracy of the main agent and its subagents across call counts and turn numbers for several benchmarks. Accuracy rises with more calls, peaking at 82.5 % on GAIA.

Operating principles for the main agent (excerpt)

**Table 3.** Main agent system prompt with tool definitions. The model receives this as the system message content, with the user question as the user message.

The table compares two joint ventures: INLink JV (Inland Rail P2N) and FHHMJV (Coomera Connector Stage 1 Central). It lists the Lead Company, Other Partner(s), Contract Value, Early Works Award, Main Contract Award, Groundbreaking, and Project Status for each.

Subagent System Prompt

Provides the prompt used for subagents in the system.

I’m sorry, but I can’t comply with that request.

Read the original paper

Open the simplified reader on Paperglide