Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

Retrospective Harness Optimization (RHO) improves agent performance by self-analyzing past trajectories without external labels.

How can an AI agent improve its own tool-use and workflow harness using only its past experience, without needing external ground-truth labels?

AI agents rely on a "harness"—a collection of tools, prompts, and skills—to solve complex tasks, but improving this harness typically requires expensive, labeled validation sets that are unavailable in real-world deployments. RHO automates this evolution by retrospectively analyzing an agent's own past trajectories: it selects a diverse coreset of challenging tasks, generates parallel rollouts, and uses self-validation and self-consistency signals to propose and select harness updates. On the SWE-Bench Pro benchmark, a single round of RHO improves the pass rate from 59% to 78% without any external grading, outperforming validation-feedback-driven methods at comparable compute budgets.

Paper Primer

The core move is a self-supervised loop that treats the agent's own history as a diagnostic dataset. By re-solving a diverse coreset of past tasks, the agent identifies its own failure modes—such as incorrect tool usage or inconsistent reasoning—and uses these insights to rewrite its own instructions and tools.

RHO is a three-stage pipeline: first, a Determinantal Point Process (DPP) selects a coreset of tasks that are both difficult and diverse; second, the agent performs group rollouts to extract self-validation and self-consistency signals; third, it generates multiple candidate harnesses and uses pairwise self-preference to select the most effective update.

RHO achieves significant performance gains on long-horizon tasks without ground-truth labels.

Pass rate on SWE-Bench Pro increased from 59% to 78%.

RHO outperforms validation-feedback-based optimization at a lower compute cost.

At a single-round budget, RHO (78% pass rate) significantly exceeds Meta-Harness (62% pass rate).

Why is this approach better than simply accumulating more experience or using a standard memory bank?

Prior methods like Dynamic Cheat-sheet or ReasoningBank only curate memory or text-based skills, leaving the executable harness (tools and workflows) untouched. RHO optimizes the full harness, allowing the agent to develop new tools and structural workflows that directly address its specific failure modes.

What is the primary limitation of this method?

RHO assumes the environment resets cleanly and tolerates repeated attempts, making it unsuitable for one-shot or irreversible tasks. Additionally, it relies on the agent's own judgment, which could potentially entrench biases or unsafe procedures if the agent's self-preference is flawed.

RHO demonstrates that agents can effectively self-optimize their own operational infrastructure using only past experience, shifting the burden of improvement from external human-labeled validation to internal retrospective analysis.

The method's reliance on self-preference means that if an agent consistently prefers incorrect or unsafe procedures, those behaviors may be reinforced; human audit logs and safety checks remain necessary for high-impact deployments.

Introduction

The paper frames the wasteful compute problem and proposes a self‑supervised retrospective harness optimization.

Training large models repeatedly processes tokens that the model already predicts with high confidence, squandering compute. Existing methods rely on labeled validation sets to decide which parts of the harness to improve, but such data are scarce in deployment. This motivates a new approach that skips already‑confident tokens and optimizes the harness without external labels.

A traditional loop that refines the agent’s harness by repeatedly evaluating candidate updates on a held‑out labeled validation set.

**Figure 1.** RHO versus validation-feedback harness optimization. Validation-feedback methods iterate against a labeled validation set, whereas RHO optimizes from past trajectories in a single retrospective pass with no ground-truth labels.

The shift from labeled validation to self‑supervised retrospective improvement enables continual harness evolution without external data.

Related Work

Survey of prior harness and self‑improvement approaches, positioning RHO.

Harness optimization improves an agent by editing the prompts, program parameters, or workflow code that surround a fixed model.

RHO scans past unlabeled trajectories once, extracts useful patterns, and rewrites the entire harness without any validation metric.

Instead of external labels, the agent judges its own past actions and uses those self‑evaluations to refine its harness.

Searches prompt and pipeline parameters using a labeled validation metric to maximize downstream performance.

Compiles high‑level pipeline specifications into executable code, then tunes parameters against a validation loss.

Applies gradient‑based updates directly to textual prompts, using a differentiable surrogate of the validation score.

Iteratively rewrites prompts by reflecting on past successes and failures, guided by a validation metric.

Meta‑agent rewrites the agent’s own code, searching over system designs with a validation metric.

Searches over harness code using execution traces and scores of prior candidates, still requiring a validation metric.

Maintains a self‑curated memory of reusable strategies and code snippets at test time.

Distills generalizable reasoning strategies from self‑judged successes and failures.

Coordinates a multi‑agent memory cycle and repairs the memory bank against self‑generated probe questions.

Precomputes useful context offline before queries arrive, reducing online latency.

Evolves the memory system itself as an executable program, discovering task‑specific memory harnesses.

Trains a skill curator with reinforcement learning from outcome and judge rewards, updating a skill repository from accumulated experience.

All of the above methods steer harness search with a labeled validation metric; RHO departs by requiring none and improving the full harness in a single retrospective pass.

Defining the Agent Harness

Defines the target harness optimization and why its utility is hard to measure.

Estimating the true quality of a harness requires a representative validation set of future tasks, but the utility $U(t,\tau)$ is latent and cannot be observed directly, making direct optimization infeasible.

A harness $h$ is a persistent toolbox of prompts, external tools, and learned skills that an agent draws on whenever it attempts a new task.

A trajectory $\tau$ records the full execution trace of an agent solving a task: observations, chain‑of‑thought steps, tool invocations, and the final answer.

Because $U$ cannot be measured, the paper replaces it with a self‑preference estimator: the agent compares several trajectories for the same task and produces a ranking together with a textual rationale.

The ranking function $\text{rank}(t,\tau_1,\dots,\tau_m) = (\text{order},\text{rationale})$ returns a preference ordering over the $m$ candidate trajectories and an explanation of why the chosen ordering is preferred.

Algorithm 1 describes a single RHO round: (1) select a coreset of informative trajectories, (2) generate $G$ parallel rollouts per core task, and (3) propose $N$ candidate harnesses, evaluate them with the ranking function, and adopt the best‑scoring harness if it improves the aggregate rank.

The RHO Pipeline

RHO learns a better harness from past runs by focusing on a small, hard, and diverse task set.

Training large models repeatedly re‑processes tokens it already predicts confidently, wasting compute. RHO cuts this waste by learning from past trajectories through a three‑stage pipeline.

Instead of fine‑tuning on every past run, RHO picks a small, challenging subset of tasks, re‑rolls them in parallel to surface failure signals, and then proposes several candidate harnesses, keeping the one that best improves performance on that subset.

How does RHO differ from standard fine‑tuning on the full trajectory set?

Standard fine‑tuning treats every past run equally and updates the harness with a single gradient step, while RHO first concentrates on the most informative failures (via the coreset), extracts structured improvement signals from multiple rollouts, and then evaluates several stochastic harness updates before committing to the best one.

Like picking a diverse set of exam questions that cover both easy and hard topics, Coreset Selection chooses a handful of past tasks that are simultaneously difficult and varied, ensuring the later optimization focuses on the most informative failures.

Normalize difficulties: max $=0.9$, so $\tilde r = [(0.9/0.9)^{\alpha}, (0.4/0.9)^{\alpha}, (0.8/0.9)^{\alpha}, (0.3/0.9)^{\alpha}]$ with $\alpha\approx0.875$ → $\tilde r\approx[1.0, 0.46, 0.88, 0.34]$.

Form kernel $K = \operatorname{diag}(\tilde r)\,S\,\operatorname{diag}(\tilde r)$, yielding a $4\times4$ matrix whose diagonal entries are $\tilde r_i^2$ and off‑diagonals are $\tilde r_i S_{i,j}\tilde r_j$.

Run DPP sampling with $k=2$; the subset with highest determinant is $\{\,\tau_1, \tau_3\,\}$ (the two hardest yet not too similar).

The DPP balances raw difficulty against pairwise similarity, preventing the coreset from collapsing onto a cluster of equally hard but redundant tasks.

Why not simply pick the top‑$k$ hardest trajectories?

Choosing only the hardest ignores redundancy; many hard trajectories may share the same failure mode. The DPP’s determinant term penalizes similarity, ensuring the selected set spans distinct failure patterns, which yields richer improvement signals later.

For each selected task we run several parallel solves, then let the agent compare its own attempts to spot where it fails (self‑validation) and where its behavior is inconsistent across runs (self‑consistency).

Self‑validation flags $\tau^{(1)}_t$ for premature stopping (rank$_{val}$ failure).

Self‑consistency detects that the tool order in $\tau^{(2)}_t$ contradicts $\tau^{(1)}_t$ (rank$_{con}$ disagreement).

Both flags are merged into $I_t$, e.g. “extend horizon by 2 steps” and “prefer tool A before tool B”.

Using both dimensions surfaces complementary weaknesses that a single rollout would miss.

How is this different from simply taking a majority vote over the $G$ rollouts?

Majority voting would only keep the most frequent outcome, discarding the diagnostic information about *why* the minority runs differ. Self‑validation and self‑consistency explicitly extract failure modes and contradictions, turning them into actionable improvement instructions.

Because the improvement signal is noisy, RHO samples many candidate harnesses and lets the agent rank them by how much their rollouts beat the original on the coreset, keeping the best.

Compute $S_1 = (3/2 + 2/2)/2 = (1.5 + 1)/2 = 1.25$.

Compute $S_2 = (2/2 + 4/2)/2 = (1 + 2)/2 = 1.5$.

Both scores are $>0$, so we pick $h_2$ (higher $S_j$).

Sampling multiple candidates protects against a single noisy optimization step that might accidentally degrade performance.

Why not just keep the first candidate harness generated?

The generation process is stochastic; a single sample can miss the improvement direction indicated by $I_t$. By evaluating several candidates we increase the chance of finding a harness that truly benefits the coreset, as reflected in a positive $S_j$.

Select a coreset $D_{\text{core}}$ of $k$ trajectories using DPP with parameter $\theta$.

For each task in $D_{\text{core}}$, run $G$ parallel solves and compute self‑validation ($rank_{val}$) and self‑consistency ($rank_{con}$) analyses.

Merge the two analyses into improvement instructions $I_t$ and aggregate across tasks.

Sample $N$ candidate harnesses $h_1,\dots,h_N$ conditioned on the aggregated instructions.

Run each candidate on the coreset, compute pairwise preference ranks, and calculate scores $S_j$.

Select the candidate with maximal $S_j$ (require $S_j>0$) as the new harness.

Experimental Results

RHO delivers a 19 % gain on SWE‑Bench Pro while staying within the same compute budget.

RHO improves the SWE‑Bench Pro pass rate by an absolute 19 % over the vanilla Codex harness.

Table 1 reports a 19 % $\Delta$ for SWE‑Bench Pro, the largest gain among the three benchmarks.

A task‑driven agent that runs code defined in a harness folder, using a large language model (GPT‑5.5) to reason about which scripts, skills, and instructions to invoke.

**Table 1.** Held-out pass rate after harness optimization. The Architecture column indicates which harness surface each method edits. $\Delta$ is the absolute change over Vanilla Codex on the same held-out split.

**Figure 3.** Highest-scoring harness produced by RHO on each benchmark. *Instructions* are task-agnostic procedural rules, *Skills* record grader or environment idiosyncrasies that previously caused failures, and *Tools* are executable scripts. Items shown are representative, and the full verbatim contents of each harness are in Appendix H.

Behavioral Analysis

RHO skips confidently predicted tokens, cutting compute while preserving accuracy.

The central premise—that large‑scale training wastes compute on already‑confident tokens—underlies every ablation we discuss.

**Figure 4.** Behavior shift after RHO. RHO sustains longer working sessions and shifts the agent's per-step action mix toward verification on SWE-Bench Pro, and toward execution on Terminal-Bench 2 and GAIA-2.

These behavior changes imply that RHO does not merely add new tools; it reshapes the agent’s workflow, allocating more steps to quality‑control actions where they matter most.

**Figure 5.** Coreset selection on SWE-Bench Pro. (a) Where each selector's picks land on the task embedding, with coverage spreading out, difficulty clustering, and RHO's DPP balancing both. (b) Held-out pass rate of the harness optimized from each coreset. Difficulty or diversity alone trails even random sampling, and only the DPP's combination reaches the top gain.

Table 3 reports the best‑of‑N harness proposal versus a single sampled candidate, highlighting modest gains on the two harder benchmarks.

RHO’s best‑of‑N selection improves Terminal‑Bench 2 pass rate by +0.02 over the mean candidate.

Table 3: mean 0.74 → chosen 0.76.

RHO’s best‑of‑N selection improves GAIA‑2 pass rate by +0.03 over the mean candidate.

Table 3: mean 0.34 → chosen 0.37.

Removing the self‑consistency signal drops SWE‑Bench Pro pass rate by ‑0.22.

Table 4: full diagnosis 0.78 → ‑ self‑consistency 0.56.

Removing the self‑validation signal drops SWE‑Bench Pro pass rate by ‑0.08.

Table 4: full diagnosis 0.78 → ‑ self‑validation 0.70.

Skipping diagnosis and feeding raw trajectories drops SWE‑Bench Pro pass rate by ‑0.18.

Table 4: full diagnosis 0.78 → raw trajectory 0.60.

Conclusion and Limitations

Conclusion summarizing RHO’s gains, limits, and ethical safeguards.

We introduced RHO, a retrospective self‑supervised loop that improves an agent’s harness by replaying its own past trajectories.

Across software engineering, technical work, and knowledge work, this loop consistently yields held‑out performance gains, demonstrating that an agent’s own experience can replace external ground‑truth feedback.

One limitation is that group rollout replays each coreset task multiple times, which presumes environments that reset cleanly and tolerate repeated attempts; one‑shot or irreversible tasks fall outside RHO’s scope.

Another limitation is the reliance on an editable harness of prompts, skills, and tools; extending RHO to domains with different harness surfaces or rollout budgets remains future work.

Finally, trusting past trajectories as the sole optimization input opens a vulnerability: adversarial content injected mid‑task could be distilled into the harness, reinforcing undesirable behavior.

Ethically, RHO’s ability to modify persistent agent behavior can amplify mistaken preferences or unsafe procedures; deployments should retain full audit logs, require human approval for sensitive edits, and enforce domain‑specific safety checks.

Reproducibility

Details for reproducing experiments and a table comparing RHO to prior methods.

Every experiment logs the full prompt, model completions, trajectory data, diagnostic outputs, candidate harness diffs, configuration files, scores, run metadata, and held‑out reports; the numeric results reported in the paper are read directly from these logs.

**Table 5.** Comparison of RHO with prior methods. The right block marks whether each method meets RHO's setting along three axes. Label-free: uses no ground-truth metric or validation set. Full harness: edits executable tools and skills, not memory or prompt text alone. Single pass: a one-shot retrospective pass, rather than an online stream, an iterative validation-scored search, or a weight-training loop. ● satisfied, ◐ partial, ○ not satisfied.

Agent Operators

B PromptsThis appendix collects the prompts that instantiate the five agent operators of RHO (§3, Algorithm 1, Figure 2), namely solve, the difficulty judge used in Coreset Selection, the diagnosis analysis, optimize, and rank. The prompts are reproduced verbatim. Placeholders of the form {name} are filled at call time with the values described under each block.

B.1 Solve

Every solve(h, t) call materializes the harness and the task into a fresh workspace at harness/ and task/ and hands the agent the wrapper instructions below. The wrapper is task-agnostic, in that the harness directory carries all task-shaping guidance, and the agent is told only how to read the workspace and how to deliver a final answer. The same wrapper is reused for the baseline rollout and for every candidate-harness rollout so that solve is the only varying input.

Listing 1 The solve wrapper prompt.

Solve the task defined in task/prompt.md, using the information and tools available under harness/.

The harness (harness/) is a toolkit of resources and guidance that helps the agent solve tasks. It can contain any type of file – helper scripts, artifacts, environment setup, documentation with relevant context, and workflows to follow.

Workspace layout:

harness/ task/

- read and invoke anything here, but do not modify it - files for this task, including prompt.md

Steps: 1. Familiarize yourself with the information and available tools in harness/. 2. Read and analyze task/prompt.md. 3. Complete the task. For code repair tasks, modify files directly under task/repo/. 4. Present your final answer in your last message, in the format prompt.md specifies (or plain prose if unspecified).

B.2 Coreset Selection (Difficulty Judge)

The difficulty judge produces the score ri ∈ [0, 10] and the abstract fingerprint ϕi that drive the DPP coreset selector of §4.1. It sees the task description together with a length-bounded digest of one short prior trajectory under the current harness. The digest is truncated head/tail to a fixed token budget, and any commands that read the task’s expected-answer files are scrubbed before the digest is shown to the judge. The judge is asked to keep the fingerprint in task-agnostic structural vocabulary so that fingerprints from different codebases remain comparable under cosine similarity.

Listing 2 The difficulty judge prompt used in Coreset Selection.

Rate the difficulty of the following software engineering task and write an abstract structural fingerprint of it. You may also use the observed agent run below to inform your judgment.

Output a JSON object (no markdown fences, no extra text): { "difficulty": <float in [0.0, 10.0]>, "`abstract_fingerprint`": "<see guide below>" }

Difficulty scale: - 0-2: trivial (obvious one-line fix, cosmetic change). - 3-5: moderate (localized change, well-defined spec). - 6-8: hard (multi-file, non-obvious design, subtle bugs). - 9-10: very hard (cross-cutting refactor, deep reasoning required).

Abstract fingerprint guide – 3‑5 sentences (~60‑120 words) describing the *shape* of the problem in vocabulary that would apply equally to any software project. Cover: - Failure mode: what typically goes wrong (partial propagation, missed boundary case, silent precedence regression, broken invariant under a new branch, etc.). - Source of difficulty: what makes it hard or easy (scattered invariants, ambiguous spec, tight coupling, purely localized arithmetic, etc.). - Technical complexity: scope (single-function / multi-file / cross-module / architectural), reasoning depth (local / contextual / global invariant tracking), type of change (bug fix / feature / refactor / rollback).

Do NOT mention: repository, product, company, framework, or library names; file paths; function, class, config, or variable names; domain-specific nouns tied to a particular codebase. Use only abstract structural programming vocabulary (invariant, precedence, boundary, state reconciliation, propagation, ordering, contract, etc.).

The observed agent run contains concrete file paths, library names, and tool output. Abstract these out the same way – fingerprints describe shapes, not the specific codebase the agent happened to run in.

The observed run is a single noisy sample, not ground truth. Do not lower difficulty just because the agent appeared to succeed in one attempt, and do not raise difficulty just because one attempt thrashed on bootstrap issues. Treat the trajectory as evidence that adjusts your prior on task difficulty and failure mode; weight it relative to what the task description itself implies.

Example of a well‑abstracted fingerprint:

"A multi-file refactor whose difficulty comes from keeping a single shared invariant consistent across several independently‑evolving modules; the typical failure mode is partial propagation, where one call site adopts the new contract while another silently keeps the old, producing a latent bug that only surfaces under a specific input ordering. Spec ambiguity is low but reasoning must be global – the change is mechanically small per site but requires tracing a contract through the call graph."

Task: --- {query} ---

Observed agent run (under the current harness): --- {`trajectory_digest`} ---

{query} is the natural‑language task description. {`trajectory_digest`} is the scrubbed, head/tail‑truncated digest of one prior trajectory for the same task. This difficulty value is the score ri ∈ [0, 10] that enters the DPP kernel, where it is normalized by ri/10 as in §4.1, and the fingerprint is embedded to a unit vector xi that defines the similarity matrix $S$ = XXᵀ.

B.3 Diagnosis

The diagnosis prompt implements It = rankval(t, {$\tau$g}₍g=1₎) (§4.2). The workspace presents the agent with the original task, the shared harness used by every rollout, and G rollout directories. The agent executes a five‑step workflow, comprising per‑trajectory inspection, failure‑mode analysis (self‑validation), cross‑trajectory disagreement analysis (self‑consistency), a single high‑level harness improvement direction, and a severity score that doubles as a soft attention weight downstream. The structured JSON output binds each field to a fixed slot so optimize can attend by severity.

Listing 3 The diagnosis prompt.

Analyze three solve trajectories for the same task.

- the original task. Read task/prompt.md to understand the task/question. - `trajectory_0`/ – `final_message`.txt, and `workspace_diff`/. - `trajectory_1`/ – `final_message`.txt, and `workspace_diff`/. - `trajectory_2`/ – `final_message`.txt, and `workspace_diff`/. - second solve attempt. Contains events.jsonl, - third solve attempt. Contains events.jsonl, - the shared harness used by all three trajectories. - first solve attempt. Contains events.jsonl,

Your job is to analyze and evaluate the three trajectories. Follow this workflow:

## Step 1: Inspect each trajectory

For each of `trajectory_0`, `trajectory_1`, and `trajectory_2`: 1. Inspect `final_message`.txt and events.jsonl to understand the action and decision process. 2. Evaluate whether the trajectory accurately and efficiently completed the task. 3. Set successful to 1 if the trajectory accurately completed the task, otherwise set it to 0. 4. In `quality_analysis`, note what evidence, files, tools, or reasoning steps the trajectory relied on, and whether there was unnecessary work, missed information, misleading evidence, or an incorrect decision.

## Step 2: Analyze failure modes

If all three trajectories accurately and efficiently completed the task, this section can be brief. Otherwise, analyze why one or more trajectories failed or performed poorly. Make this analysis faithful and actionable. Ground it in what the trajectories actually did.

## Step 3: Analyze inconsistency

Compare the three event sequences and final answers. Identify whether there are inconsistencies among them: where and why the trajectories diverged, and how those differences affected the behavior.

## Step 4: Summarize harness improvement direction

Read the original paper

Open the simplified reader on Paperglide