Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

Arbor organizes autonomous research as a persistent hypothesis tree, turning iterative experiments into cumulative, verified progress.

How can an autonomous agent effectively manage the iterative, non-linear process of scientific research through structured hypothesis exploration?

Autonomous research agents often treat each experiment as an isolated attempt, losing the structure of the research process as they cycle through trials. This leads to fragmented progress where agents fail to learn from past failures or build upon successful insights. Arbor addresses this by maintaining a persistent hypothesis tree that binds hypotheses, artifacts, and distilled insights across time. A long-lived coordinator manages this tree, while short-lived executors test individual branches in isolated worktrees. On six real-world research tasks, Arbor achieves more than 2.5× the average relative held-out gain of standard coding agents, consistently outperforming them by organizing exploration into a cumulative, auditable process.

Paper Primer

Arbor functions as a two-level research system: a persistent coordinator owns the global research state, while short-lived executors perform grounded engineering work. The core mechanism is Hypothesis Tree Refinement (HTR): a tree structure where each node links a hypothesis to its implementation, experimental evidence, and a distilled insight that informs future search.

Arbor delivers superior held-out performance across diverse research domains.

In six real-world research tasks (model training, harness engineering, and data synthesis), Arbor achieved the best held-out result in every instance. It attained more than 2.5× the average relative held-out gain compared to Codex and Claude Code.

The hypothesis tree structure is the primary driver of performance gains.

Ablation studies on MLE-Bench Lite show that removing the tree structure or disabling insight feedback significantly degrades medal rates, even while maintaining valid submission rates. The full Arbor system achieved an 81.82% "Any Medal" rate, compared to 63.64% without the tree structure.

Why does Arbor use a tree structure instead of just keeping a long history of past attempts?

A tree preserves the branching nature of research, allowing the agent to maintain competing hypotheses simultaneously. It also enables the system to propagate distilled insights upward, turning local experimental outcomes into direction-level lessons that constrain future exploration.

How does Arbor prevent the agent from simply "gaming" the development evaluator?

Arbor enforces a held-out merge gate: while development feedback guides the search, a candidate artifact is only promoted to the "current best" if it demonstrates improvement on a separate, held-out test evaluator.

The Challenge of Autonomous Research

Arbor reframes autonomous research from linear trial loops to a persistent hypothesis tree.

Current autonomous agents treat research as a sequence of isolated attempts, which discards the knowledge gained from earlier trials. Without a mechanism to retain and reuse this information, long‑horizon progress stalls despite extended execution time.

Researchers need a cumulative research state that preserves hypotheses, artifacts, and evidence, rather than discarding each trial after it finishes.

**Figure 1.** Arbor at a glance. (a) Hypothesis tree and (b) development score from one Math-Reasoning Data Synthesis run. (c) Normalized held-out gains across all tasks.

The key shift is moving from linear agentic loops to a tree‑based research state that accumulates and reuses knowledge.

Prior Approaches to Autonomous Research

Survey of prior autonomous research systems and their design choices.

Prior work on autonomous research agents spans idea generation, code search, and hierarchical hypothesis management. These systems differ in what they search, how they organize state, and which roles they expose.

The AI Scientist is an LLM‑driven pipeline that cycles through idea generation, implementation, execution, result interpretation, and paper drafting.

Claude Code is a code‑generation model that turns natural‑language specifications into executable programs, serving as the implementation engine for many research agents.

Arbor’s hypothesis‑tree design unifies these trends by externalizing research state, enabling systematic backtracking, evidence preservation, and auditable merge decisions.

The Arbor Framework and HTR

Arbor turns autonomous research into a structured hypothesis tree that branches, refines, and merges ideas.

Autonomous research must turn many fleeting trials into lasting progress. The three design requirements—branching with coherence, a global strategy that does not drown in low‑level traces, and a held‑out gate that admits only transferable improvements—capture the core tension.

The AO interface treats a research task as a triplet $(\mathcal{M}_0,\mathcal{O},\mathcal{E})$—an initial artifact, an objective, and evaluators—so the agent must propose hypotheses, materialize them, and use feedback to steer future proposals.

How does AO differ from a standard tool‑use prompt?

In AO the agent must manage a sequence of hypotheses and evaluate them against two distinct metrics, whereas a normal tool call only produces one immediate result without a persistent research state.

Arbor externalizes the research state into a persistent hypothesis tree, letting a long‑lived coordinator orchestrate high‑level strategy while short‑lived executors perform isolated implementations.

Why is a persistent tree better than a flat log of experiments?

Because the tree preserves the hierarchical relationships between hypotheses, allowing the coordinator to propagate insights upward and prune entire sub‑directions rather than discarding isolated trials.

HTR treats research as a mutable tree: each node bundles a hypothesis, its distilled insight, and a reference to the concrete artifact that implements it.

How does HTR keep the tree from degenerating into a raw log of tool calls?

By abstracting each executor’s raw output into a distilled insight $\iota_n$ and propagating that insight upward, the tree stores only the semantic lessons, not the full execution trace.

Iteration 1: Coordinator observes $n_0$, ideates two children $n_1$ (“add 16 units”) and $n_2$ (“add 32 units”). Both are pending.

Iteration 2: Select $n_1$ and $n_2$, dispatch executors. Executor for $n_1$ yields development score $s_1=0.62$, insight $\iota_1$ = “wider net improves training loss”. Executor for $n_2$ yields $s_2=0.58$, insight $\iota_2$ = “larger width overfits early”.

Back‑propagation writes $s_1,s_2,\iota_1,\iota_2$ into $n_0$ and updates $\iota_0$ = “moderate widening helps, excessive widening harms”.

Decide step picks $n_1$ (higher $s$), evaluates it on $\mathcal{E}_{\text{test}}$, and merges it because $0.62 >$ current best $0.60$.

Budget $B$ is reduced by 2 (two executor calls); remaining budget $=3$.

Even with a tiny budget, the tree captures both quantitative scores and qualitative lessons, enabling the coordinator to prune the over‑wide branch without re‑running it.

Observe: project the current tree into a structured view of pending leaves, recent scores, and accumulated insights.

Ideate: pick a parent node and generate $k$ new pending child hypotheses conditioned on ancestor insights.

Select: choose a subset of pending leaves to evaluate, balancing expected utility and diversity.

Dispatch: launch parallel executors for each selected leaf, each working in an isolated worktree.

Back‑propagate: write returned scores and distilled insights into leaf nodes and update all ancestors’ insights.

Decide: run the held‑out merge gate; if a leaf improves $\mathcal{O}(\mathcal{E}_{\text{test}})$, merge its artifact into $\mathcal{M}_{\text{best}}$ and prune falsified subtrees.

**Figure 2.** Overall framework of Arbor. A persistent coordinator maintains the research state as a hypothesis tree, iteratively exploring ideas, dispatching executors to implement them, and using evaluation feedback to refine the tree and update the current best artifact.

**Algorithm 1: Hypothesis Tree Refinement (HTR).** Coordinator owns $Tree$; Executor owns one worktree. **Input**: $\mathcal{P} = (\mathcal{M}_0, \mathcal{O}, \mathcal{E}_{\text{dev}}, \mathcal{E}_{\text{test}})$, budget $B$, branching $k$ **Output**: best artifact $\mathcal{M}^\star$ and hypothesis tree $Tree$ 1. init $Tree = (\{n_0\}, \emptyset)$, $b_{n_0} \leftarrow \mathcal{M}_0$, $\mathcal{M}_{\text{best}} \leftarrow \mathcal{M}_0$ 2. **while** $B$ left $\land$ pending leaves exist **do** 3. $\quad \mathcal{V} \leftarrow \text{OBSERVE}(Tree, \mathcal{M}_{\text{best}})$ $\quad$ // Observe: shape, root insight, pruned/validated lessons 4. $\quad p \leftarrow \text{choose parent under } \mathcal{V}$; attach $k$ pending children $\{n^{(i)} : h^{(i)}\} \leftarrow \text{IDEATE}(p, \mathcal{V})$ $\quad$ // Ideate 5. $\quad L \leftarrow$ pending leaves under $\text{SELECT}(\mathcal{V})$ $\quad$ // Select: frontier control 6. $\quad \{(s_n, r_n, \iota_n, b_n)\}_{n \in L} \leftarrow \text{parallel Executor}(h_n, \iota_{\text{anc}(n)}, \mathcal{M}_{\text{best}})$ $\quad$ // Dispatch 7. $\quad$ **foreach** $n \in L, a \in \text{path}(n_0 \rightarrow n)$ **do** 8. $\quad \quad$ write back $(s_n, r_n, \iota_n, b_n)$; $\iota_a \leftarrow \text{ABSTRACT}(\{\iota_c\}_{c \in \text{ch}(a)})$ $\quad$ // Backpropagate 9. $\quad$ **end foreach** 10. $\quad n^\dagger \leftarrow \arg \max_{n \in L} s_n$ $\quad$ // Decide: held-out merge gate, then prune 11. $\quad$ **if** $\mathcal{O}(\mathcal{E}_{\text{test}}(b_{n^\dagger})) > \mathcal{O}(\mathcal{E}_{\text{test}}(\mathcal{M}_{\text{best}}))$ **then** $\mathcal{M}_{\text{best}} \leftarrow \text{merge}(b_{n^\dagger})$ 12. $\quad$ prune subtrees falsified by $\{\iota_n\}_{n \in L}$; persist $Tree$ 13. **end while** 14. **return** $\mathcal{M}^\star \leftarrow \mathcal{M}_{\text{best}}, Tree$ 15. **Procedure** $\text{Executor}(h_n, \iota_{\text{anc}(n)}, \mathcal{M}_{\text{best}})$: 16. $\quad$ fresh worktree $W_n \leftarrow \mathcal{M}_{\text{best}}$ 17. $\quad$ **repeat** 18. $\quad \quad \Delta \leftarrow \text{Implement}(h_n, \iota_{\text{anc}(n)}, W_n)$; $(s_n, r_n) \leftarrow \mathcal{E}_{\text{dev}}(\text{apply}(\Delta, W_n))$ $\quad$ // repair $\Delta$ only; $h_n$ is fixed 19. $\quad$ **until** run ok $\land$ $h_n$-path exercised, or cap reached 20. $\quad$ **return** $(s_n, r_n, \text{DISTILL}(h_n, \Delta, r_n), \text{commit}(W_n))$

By maintaining a durable hypothesis tree and a clear coordinator‑executor contract, Arbor converts noisy trial‑and‑error into a disciplined, evidence‑driven research process.

Experimental Task Suite

Arbor is benchmarked on a diverse suite of six autonomous‑research tasks.

The AO Task Suite assembles six real‑world research tasks, each with an initial material $M_0$, a natural‑language objective $O$, development and test evaluators, and a task‑native metric (Table 1).

Arbor is evaluated on six AO tasks spanning three research domains.

Table 1 lists the six tasks: Optimizer Design, Architecture Design, Terminal‑Bench 2.0, BrowseComp, Search‑Agent Data Synthesis, and Math‑Reasoning Data Synthesis.

Experimental Setup uses two benchmark families: the AO Task Suite described above and MLE‑Bench Lite, a long‑horizon ML‑engineering benchmark derived from MLE‑Bench.

For the AO tasks we compare Arbor against two strong coding‑agent baselines: Codex (GPT‑5.5) and Claude Code (Claude Opus 4.6). Both receive identical initial material, objectives, evaluators, and the 48‑hour budget.

MLE‑Bench Lite results are taken from the official leaderboard, which includes systems such as AIDE, ML‑Master 2.0, AIRA‑dojo, InternAgent, R&D‑Agent, Famou‑Agent 2.0, MARS, Leeroo, AIBuildAI, LoongFlow, and AI‑Scientist‑style agents.

Metrics are reported natively per task (direction indicated in Table 1). For cross‑task averages we compute a normalized held‑out improvement $\Delta_{\text{test}}$ that orients all metrics so larger is better, using absolute change for percentage metrics and relative improvement for steps/loss.

Implementation details: both the Coordinator and Executors use Claude Opus 4.6 as the backbone model. All runs (Arbor, Codex, Claude Code) are launched with a 48‑hour wall‑clock limit. Arbor’s default budget is 20 Coordinator cycles with a maximum tree depth 2; Executor parallelism follows available evaluator resources.

The diversity of tasks validates Arbor’s generality across model training, harness engineering, and data synthesis.

Main Empirical Results

Arbor delivers consistent gains, topping baselines on most real research tasks.

Arbor outperforms all baselines on the Data Synthesis test metric, achieving a 5.92 % improvement over the initial material.

Table 2 shows Arbor’s test score 1.083 versus the initial 1.096, a relative gain of +5.92 %.

Arbor consistently attains the top score across all categories except the Terminal‑Bench 2.0 development metric, where Claude Code marginally exceeds it. The shaded $\Delta$ rows highlight the relative improvements over the initial material for each task.

**Table 2.** Main results on real research tasks. Each task reports native development and held-out test metrics for the initial material, single-agent baselines, and Arbor; the task label shows the native metric direction. Shaded $\Delta$ rows report relative improvements over the initial material for Model Training tasks, and absolute changes for all other tasks.

Analysis of Research Traces

Arbor’s hierarchical search yields consistent held‑out gains across six autonomous research tasks.

Recall Arbor replaces linear trial sequences with a hypothesis‑tree that can backtrack and branch.

Arbor achieves up to +22.34 held‑out points higher than the strongest baseline on the six AO tasks.

Table 2 shows Arbor’s held‑out scores surpass Codex and Claude Code on all tasks, with the largest gap on BrowseComp (+22.34 points).

Across the six tasks—optimizer design, architecture design, terminal‑bench, BrowseComp, search‑agent, and math‑reasoning—Arbor consistently reaches the highest held‑out scores, often earlier in the run. The pattern holds while using identical controller depth and only varying initial material and evaluators.

**Figure 5.** Exploration efficiency on the six AO tasks (one panel per task). Curves show the best-so-far development gain over the run, normalized to Arbor’s final gain, so Arbor (solid) ends at 100% and the Claude Code baseline (dashed) at its own relative ceiling. Stars mark each method’s held-out test maximum, annotated with the test score from Table 2.

**Figure 6.** Evolution of task understanding across the BrowseComp run. Each upper-tier box states the current problem framing, the fix attempted, and the mechanistic finding that drove the next shift. The lower tier shows the experimental nodes behind each transition.

The dev/test split reveals that optimizing solely on development feedback can overfit, as seen with Claude Code’s 75.00 % dev score but only 71.70 % held‑out. Arbor mitigates this by promoting candidates only when they improve the held‑out metric, decoupling exploration from verification.

Arbor’s hierarchical state evolution directly drives its superior held‑out performance, showing that structured exploration outweighs raw compute.

Extended Performance Benchmarks

Arbor reaches perfect success, beating all baselines by a large margin.

Arbor attains a perfect 100 % success rate on the benchmark, surpassing the runner‑up by 13.64 percentage points.

All competing systems (Gemini‑3‑Flash, GPT‑5.5, Claude‑Opus‑4.6, etc.) achieve at most 86.36 %.

Transfer and Generality

Arbor transfers its hypothesis‑tree gains to new backbones and tasks, achieving record medal rates.

Arbor with a Gemini‑3‑Flash backbone attains an 86.36% above‑median rate on MLE‑Bench Lite, surpassing prior agents.

Table 3 shows 86.36% above‑median for Arbor (Gemini‑3‑Flash) versus lower rates for AI‑Scientist and LoongFlow.

MLE‑Bench Lite is a compact benchmark that evaluates autonomous research agents on a fixed set of ML tasks, awarding medals based on percentile performance.

**Table 3.** MLE-Bench Lite results under the official evaluation protocol. All entries are percentages.

**Figure 3.** Backbone generality and cross-task transfer. (a) Arbor is rerun with different backbone models while keeping the controller, evaluator budget, and task adapters fixed. (b) A BrowseComp-evolved search harness is frozen and evaluated on held-out search-agent tasks without further task-specific optimization.

Backbone effects are task‑dependent: Claude Opus 4.6 shines on the BrowseComp search benchmark, where broad reasoning matters, whereas GPT‑5.5 excels on MLE‑Bench Lite, where ML‑engineering knowledge is critical.

Transfer evaluation freezes the harness learned on BrowseComp and applies it to two unseen search‑agent tasks. The frozen harness raises BrowseComp held‑out accuracy from 45.33% to 67.67%, HLE from 25.50% to 31.50%, and DeepSearchQA from 61.00 ± 6.76% to 69.00 ± 6.41%.

Component Ablations

We isolate the impact of the tree and insight‑feedback components on MLE‑Bench Lite.

We evaluate the two core HTR components by ablating them on MLE‑Bench Lite. The “w/o tree” variant replaces the hierarchical hypothesis tree with a flat experiment queue, while “w/o insight feedback” keeps the tree but disables upward propagation of distilled lessons.

All variants obtain perfect validity, so the ablation gap is not caused by execution failures. The quality drop—especially in the Any Medal metric—shows that the tree chiefly improves later‑stage refinement, while insight feedback supplies the semantic memory that makes the hierarchy effective.

Full Arbor improves the Any Medal rate by 27.28 points over the w/o insight feedback ablation.

Full Arbor achieves 81.82 % Any Medal versus 54.54 % when insight feedback is removed.

**Table 4.** Component ablations on MLE-Bench Lite (Claude Opus 4.6 backbone). Entries are percentages.

**Table 5.** Arbor’s node statistics. Dev+ means nodes that improve over the baseline on the dev set.

**Figure 4.** Token budget and relative held-out gain across completed AO cost logs. Token totals sum input and output tokens; for Arbor, the total further sums coordinator and executor usage. The y-axis reports percent improvement over each task’s initial held-out score.

Limitations and Scope

Arbor externalizes research state into a hypothesis tree, letting agents backtrack, branch, and refine ideas.

Our experiments probe autonomous research but only a narrow slice of possible scientific tasks. The current AO task suite includes model training, harness engineering, and data synthesis, leaving out domains such as low‑level kernel optimization, pretraining data‑mixture design, and open‑ended system design. Extending the benchmark to biology, mathematics, and physics would require evaluating not just metric gains but also scientific meaning, reproducibility, and transferability.

The AO interface optimizes a single scalar metric supplied by a task‑specific evaluator, which simplifies real research where objectives are multi‑dimensional. Future systems should support multi‑objective search, explicit constraints, Pareto comparisons, and adaptive scheduling among criteria such as performance, resource use, robustness, interpretability, novelty, and safety. This would curb agents from overfitting to a narrow benchmark while ignoring broader research goals.

Agents can read feedback and propose local refinements, yet they rarely discover genuinely new mechanisms or persist with promising branches after early setbacks. They often reverse‑engineer solutions from observed scores instead of reasoning from first principles. Research into better uncertainty tracking, reuse of negative evidence, branch revival, and prompting for causal hypothesis generation could close this gap.

Long‑horizon autonomous research is throttled by systems engineering details such as prompt caching, evaluator scheduling, isolated environment startup, parallel worktree execution, and inter‑agent coordination. The sheer number of model and evaluator calls makes a successful search expensive even when each step is cheap. Developing cost‑aware tree policies, adaptive evaluator allocation, stronger caching, checkpointing, and more robust execution infrastructure will be essential for scaling.

Current foundation models can code, summarize, and suggest plausible local hypotheses, but they lack deep domain expertise, long causal chains, and truly creative problem reformulation. Scaling models alone may not suffice; integrating domain knowledge bases, specialized tools, simulators, formal checkers, and training signals aimed at scientific hypothesis generation is likely required. Arbor provides a structure for accumulating and testing ideas, yet the quality of those ideas remains an open frontier.

Arbor’s architecture separates a persistent Coordinator, which maintains the hypothesis tree, from short‑lived Executors that implement individual hypotheses. The Coordinator decides which branches to explore, merges successful results, and records evidence, while Executors focus solely on executing a single idea in an isolated worktree.

The system uses two prompt templates: one for the Coordinator (B.1.1) and one for the Executor (B.1.2). These prompts encode the roles, responsibilities, and constraints that each component must obey during the research loop.

The Coordinator prompt instructs the agent to keep the hypothesis tree as the shared research state, select hypotheses for execution, and translate experimental feedback into reusable evidence, without directly editing the target artifact.

The Executor prompt defines a Research Agent that implements a given hypothesis, runs experiments in an isolated git worktree, and reports results honestly, while respecting constraints such as non‑destructive actions and fixed hypothesis direction.

Algorithm 2 expands the HTR pseudocode with implementation‑level detail, defining symbols such as insight $\\iota_n$, branch reference $b_n$, dev score $s_n$, factual result $r_n$, and concatenated insights $\\iota_{anc}(n)$.

Arbor’s engineering choices are grouped into four areas: hypothesis‑tree management, experiment management, long‑horizon operation, and extensibility via skills and plugins. These design decisions differentiate Arbor from general‑purpose coding agents like Codex or Claude Code, enabling flexible adaptation to diverse research needs.

Discussion and Future Directions

Discussion of how Arbor’s hypothesis tree shapes research progress, timing, and idea quality.

Early nodes test broad mechanisms, confirming whether a coarse hypothesis holds; later nodes probe the boundaries of those mechanisms, identifying where they stop working. Ancestor insights then compress positive and negative findings into constraints that shape the final design, which preserves evidence dossiers across independent rollouts while keeping search trajectories independent.

Across tasks, the best candidates tend to appear in the middle or later stages of a run, because earlier experiments reduce arbitrariness and provide a higher‑information baseline. Later proposals are conditioned on accumulated evidence, turning successful mechanisms into priors and failed variants into negative constraints, which makes the search less repetitive and more memory‑aware.

Ideas are local and executable, targeting specific optimizer components, retrieval steps, or data‑synthesis modules, and they directly respond to earlier observations, turning “half‑right” results into new design constraints. However, progress that requires a high‑level reformulation of the problem still depends on human‑provided task design, highlighting the complementary role of human insight.

Implementation Details

Implementation details of Arbor’s hypothesis‑tree state, coordinator/executor loops, and toolkits.

Arbor externalizes research state into an in‑memory hypothesis tree that records every experiment as a node with address, hypothesis, status, score, result, insight, and a git branch reference.

Core coordinator loop (Algorithm 2) – high‑level pseudo‑code.

**Figure 8.** Agent-level implementation details of Arbor.

**Algorithm 1.** Hypothesis Tree Refinement (HTR) with expanded implementation detail. The coordinator owns $Tree$, and each Executor owns one worktree. **Input**: $\mathcal{P} = (\mathcal{M}_0, \mathcal{O}, \mathcal{E}_{\text{dev}}, \mathcal{E}_{\text{test}})$, budget $B$, branching $k$, parallelism $P$ **Output**: best artifact $\mathcal{M}^\star$, annotated hypothesis tree $Tree$, and run summary report 1 // Initialization 2 init $Tree = (\{n_0\}, \emptyset)$, $b_{n_0} \leftarrow \mathcal{M}_0$, $\mathcal{M}_{\text{best}} \leftarrow \mathcal{M}_0$ 3 run $\mathcal{E}_{\text{dev}}(\mathcal{M}_0)$ and record baseline score and eval command in $Tree$.meta via TreeSetMeta 4 // Main coordinator loop 5 **while** $B$ left $\wedge$ pending leaves exist $\wedge$ no stop signal **do** 6 // Step 1: OBSERVE, build constraint view 7 $\mathcal{V} \leftarrow \text{OBSERVE}(Tree, \mathcal{M}_{\text{best}})$ // shape, root insight, pruned lessons, validated findings 8 // Step 2: IDEATE, skill-gated hypothesis proposal 9 $\mathcal{V}_c \leftarrow \text{TreeView}(\text{format}=\text{constraints})$ // hard constraints: no re-tread of pruned directions 10 load `idea_drafting` skill via LoadSkill // mandatory gate: must precede any candidate proposal 11 $p \leftarrow \text{SELECT}_{\text{parent}}(\mathcal{V})$ 12 **foreach** surviving candidate after Fatal-Flaw Scan **do** 13 attach pending child $n^{(i)}$ with hypothesis $h^{(i)} = (\text{Mechanism, Hypothesis, Observable, Conflicts})$ 14 $l_{\text{anc}(n^{(i)})} \leftarrow$ insights on $\text{path}(n_0 \rightarrow p)$ // injected into Executor prompt 15 prune `idea_drafting` scratch from coordinator context // elide skill body + reasoning post-TreeAddNode 16 **end foreach** 17 // Step 3: SELECT frontier for parallel dispatch 18 $L \leftarrow$ up to $P$ pending leaves under $\text{SELECT}_{\text{frontier}}(\mathcal{V})$ 19 // Step 4: DISPATCH, parallel Executor dispatch 20 $\{(s_n, r_n, l_n, b_n)\}_{n \in L} \leftarrow \text{parallel Executor}(h_n, l_{\text{anc}(n)}, \mathcal{M}_{\text{best}})$ 21 // Step 5: UPDATE, write back and propagate 22 **foreach** $n \in L, a \in \text{path}(n_0 \rightarrow n)$ **do** 23 write back $(s_n, r_n, l_n, b_n)$ to node $n$ in $Tree$ and set $n.\text{status} \leftarrow \text{done}$ 24 $l_a \leftarrow \text{ABSTRACT}(\{l_c\}_{c \in \text{ch}(a)})$ // propagate insights upward to root 25 **end foreach** 26 check convergence: inject intervention if $\ge w$ consecutive non-improving experiments 27 // Step 6: DECIDE, merge gate or prune 28 $n^\dagger \leftarrow \arg \max_{n \in L} s_n$ 29 **if** $s_{n^\dagger}$ exceeds best score by $\ge \theta$ **then** 30 create detached worktree at $b_{n^\dagger}$ and run $\mathcal{E}_{\text{test}}$ with template substitution 31 **if** $S_{\text{test}}(b_{n^\dagger}) > S_{\text{test}}(\mathcal{M}_{\text{best}})$ **then** $\mathcal{M}_{\text{best}} \leftarrow \text{merge}(b_{n^\dagger})$ and update $Tree$.meta.`trunk_score` 32 **end if** 33 prune subtrees falsified by $\{l_n\}_{n \in L}$ and persist $Tree$ to JSON + Markdown 34 **end while** 35 run $\mathcal{E}_{\text{test}}(\mathcal{M}_{\text{best}})$ and record `test_trunk_score` and `test_baseline_score` 36 **return** $\mathcal{M}^\star \leftarrow \mathcal{M}_{\text{best}}, Tree$, run summary 37 **Procedure** $\text{Executor}(h_n, l_{\text{anc}(n)}, \mathcal{M}_{\text{best}})$: 38 branch name $\leftarrow \text{slug}(\text{node\_id}) + \text{slug}(h_n) + \text{SHA1}(h_n)_{[:8]}$ 39 create worktree $W_n$ in /tmp/ from current best branch HEAD on a fresh branch 40 inject eval command (with $\{\text{cwd}\} \rightarrow W_n, \{\text{node\_id}\} \rightarrow n)$ and $l_{\text{anc}(n)}$ into prompt 41 **repeat** 42 $\Delta \leftarrow \text{Implement}(h_n, l_{\text{anc}(n)}, W_n)$ 43 $(s_n, r_n) \leftarrow \mathcal{E}_{\text{dev}}(\text{apply}(\Delta, W_n))$ // repair $\Delta$ only, direction $h_n$ is fixed 44 **until** run ok $\wedge$ $h_n$-path exercised, or turn cap reached 45 filter $\Delta$: commit implementation files only, skip logs/checkpoints/caches 46 remove worktree directory and retain branch $b_n$ for later merge gate 47 **return** $(s_n, r_n, \text{DISTILL}(h_n, \Delta, r_n), b_n)$

Questions & answers

What is Arbor and what is its main contribution?

Arbor is a two-level autonomous research agent framework that introduces Hypothesis Tree Refinement (HTR), a mechanism that organizes experiments into a persistent hypothesis tree linking hypotheses, implementations, experimental evidence, and distilled insights across time. Its main contribution is replacing isolated, linear trial-and-error loops with a cumulative, auditable, tree-based research process.

What problem does Arbor address?

Arbor addresses the problem that current autonomous research agents treat each experiment as an isolated attempt, discarding knowledge gained from earlier trials and leading to fragmented progress that stalls over long horizons. Without a mechanism to retain and reuse information, agents fail to learn from past failures or build on successful insights.

How does Hypothesis Tree Refinement (HTR) work?

HTR organizes research into a tree where each node links a hypothesis to its implementation, experimental evidence, and a distilled insight (ι_n) that informs future search. A long-lived Coordinator manages the global tree and propagates distilled insights upward, while short-lived Executors test individual branches in isolated git worktrees, reporting results that are abstracted into semantic lessons rather than raw execution traces.

Why does Arbor use a tree structure instead of a flat log or linear history?

A tree preserves the hierarchical, branching nature of research, allowing the agent to maintain competing hypotheses simultaneously and propagate distilled insights upward to turn local outcomes into direction-level lessons. This also enables the Coordinator to prune entire sub-directions rather than discarding only isolated trials.

How does Arbor prevent overfitting to the development evaluator?

Arbor enforces a held-out merge gate: a candidate artifact is only promoted to 'current best' if it demonstrates improvement on a separate, held-out test evaluator (E_test), which is accessed exclusively through the GitMergeBranch tool. This decouples exploration from verification, as illustrated by Claude Code achieving a 75.00% dev score but only 71.70% held-out score on one task.

What is the coordinator-executor architecture in Arbor?

The Coordinator is a persistent agent that owns the global hypothesis tree, selects branches for exploration, merges successful results, and records evidence without directly editing target artifacts. Executors are short-lived agents that implement a single hypothesis in an isolated git worktree, run experiments using only local tools, and cannot read the trunk or sibling branches, ensuring each reported score is attributable to its generating hypothesis.

What datasets and benchmarks were used to evaluate Arbor?

Arbor was evaluated on two benchmark families: the AO Task Suite, comprising six real-world research tasks covering model training, harness engineering, and data synthesis (including optimizer design, architecture design, Terminal-Bench 2.0, BrowseComp, search-agent, and math-reasoning), and MLE-Bench Lite, a long-horizon ML-engineering benchmark derived from MLE-Bench.

What are the key quantitative results for Arbor?

Arbor achieves more than 2.5× the average relative held-out gain of standard coding agents across the six AO tasks, consistently attaining the top score across all categories except the Terminal-Bench 2.0 development metric, where Claude Code marginally exceeds it. In the transfer evaluation, freezing a harness learned on BrowseComp raised held-out accuracy from 45.33% to 67.67% on BrowseComp, from 25.50% to 31.50% on HLE, and from 61.00% to 69.00% on DeepSearchQA.

What baselines does Arbor compete against?

On the AO Task Suite, Arbor is compared against Codex (GPT-5.5) and Claude Code (Claude Opus 4.6), both given identical initial material, objectives, evaluators, and a 48-hour budget. On MLE-Bench Lite, results are compared against leaderboard systems including AIDE, ML-Master 2.0, AIRA-dojo, InternAgent, R&D-Agent, Famou-Agent 2.0, MARS, Leeroo, AIBuildAI, LoongFlow, and AI-Scientist-style agents.

What do the component ablations reveal about HTR?

Ablations on MLE-Bench Lite show that removing the hierarchical tree ('w/o tree') and replacing it with a flat experiment queue, or keeping the tree but disabling upward propagation of distilled lessons ('w/o insight feedback'), both reduce performance, particularly on the Any Medal metric. The tree chiefly improves later-stage refinement, while insight feedback supplies the semantic memory that makes the hierarchy effective.

What are the limitations of Arbor as acknowledged by the paper?

The paper identifies several limitations: the AO task suite covers only a narrow slice of scientific tasks (model training, harness engineering, data synthesis), excluding domains like biology, mathematics, physics, and kernel optimization; the system optimizes a single scalar metric rather than supporting multi-objective search; agents rarely discover genuinely new mechanisms and often reverse-engineer solutions from scores rather than reasoning from first principles; long-horizon runs are expensive due to systems engineering overhead; and current foundation models lack deep domain expertise and truly creative problem reformulation.

How does Arbor compare to prior autonomous research approaches?

Prior work spans idea generation, code search, and hierarchical hypothesis management, but these systems differ in what they search, how they organize state, and which roles they expose. Arbor's hypothesis-tree design unifies these trends by externalizing research state, enabling systematic backtracking, evidence preservation, and auditable merge decisions, distinguishing it from general-purpose coding agents like Codex or Claude Code.

How does backbone model choice affect Arbor's performance?

Backbone effects are task-dependent: Claude Opus 4.6 performs better on the BrowseComp search benchmark where broad reasoning matters, whereas GPT-5.5 excels on MLE-Bench Lite where ML-engineering knowledge is critical. Both the Coordinator and Executors in Arbor's default configuration use Claude Opus 4.6.

How is Arbor implemented and what are its key engineering details?

Arbor externalizes research state into an in-memory hypothesis tree persisted as JSON after every mutation and rendered to Markdown for human dashboards, enabling survival across crashes and context-window compression. The Coordinator uses read-only code tools (TreeView, TreeAddNode, TreeUpdateNode, TreePrune, TreeSetMeta, TreePropagate, RunSubagent, GitMergeBranch), while Executors use only local tools (Bash, RunTraining, FileRead/Edit/Write, Grep, Glob, SubAgent). Default configuration uses 20 Coordinator cycles, maximum tree depth 2, and a 48-hour wall-clock limit.

How are results normalized for cross-task comparison?

For cross-task averages, the paper computes a normalized held-out improvement (Δ_test) that orients all metrics so larger is better, using absolute change for percentage metrics and relative improvement for steps/loss metrics.

Who authored Arbor and where was it published?

The paper does not explicitly list author names in the provided text. It is available on arXiv at https://arxiv.org/abs/2606.11926; the paper does not specify a venue or publication date beyond the arXiv submission.

Key terms

Hypothesis Tree Refinement (HTR): Arbor's core mechanism that organizes research experiments into a persistent tree where each node links a hypothesis to its implementation, experimental evidence, and a distilled insight that guides future exploration.
Coordinator: The long-lived, persistent agent in Arbor that owns and manages the global hypothesis tree, selects branches for exploration, and merges successful results without directly editing target artifacts.
Executor: A short-lived agent in Arbor that implements a single hypothesis in an isolated git worktree, runs experiments, and reports results back to the Coordinator.
Distilled insight (ι_n): A semantic summary of an experiment's outcome that is abstracted from raw execution traces and propagated upward through the hypothesis tree to inform future search directions.
Held-out merge gate: A gating mechanism in Arbor that only promotes a candidate artifact to 'current best' if it strictly improves performance on a separate held-out test evaluator, preventing overfitting to development feedback.
AO Task Suite: Arbor's custom benchmark of six real-world research tasks covering model training, harness engineering, and data synthesis, each with initial material, a natural-language objective, and development and test evaluators.
MLE-Bench Lite: A long-horizon ML-engineering benchmark derived from MLE-Bench, used to evaluate Arbor against other autonomous research systems on a public leaderboard.
Development evaluator (E_dev): An evaluator accessible to Executors during experimentation that provides feedback to guide the search, but whose scores alone are insufficient to promote a candidate to best.
Test evaluator (E_test): A held-out evaluator accessible only through the GitMergeBranch tool that determines whether a candidate artifact is promoted to 'current best,' ensuring improvements are genuinely transferable.
GitMergeBranch: A Coordinator tool in Arbor that creates a detached worktree, runs the held-out test evaluator, and merges a branch into the main trunk only if the test score strictly exceeds the current best.
Worktree: An isolated git working directory in which an Executor implements and tests a single hypothesis, preventing interference with the main trunk or sibling branches.
Ancestor insights (ι_anc(n)): The concatenated distilled insights from all ancestor nodes of a given tree node, used to condition new hypotheses on accumulated evidence from prior experiments.
Δ_test (normalized held-out improvement): A cross-task metric that measures improvement over the initial material on the held-out evaluator, oriented so larger values are always better regardless of the native task metric.
BrowseComp: A search-agent benchmark task included in the AO Task Suite that tests broad reasoning and web-search capabilities, used also for transfer evaluation to unseen search tasks.
Flat experiment queue (w/o tree ablation): An ablation variant of Arbor that replaces the hierarchical hypothesis tree with a flat list of experiments, used to measure the contribution of tree structure to overall performance.
Insight feedback (w/o insight feedback ablation): An ablation variant of Arbor that retains the tree structure but disables upward propagation of distilled lessons, used to measure the contribution of semantic memory to performance.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers