Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-Style Agent Harnesses on Coding Tasks

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

Claw-SWE-Bench standardizes agent harnesses as a controlled variable to measure coding accuracy and API cost.

How do different "claw" agent harnesses and LLMs compare in performance and cost when solving real-world coding tasks on SWE-bench?

Coding agents are currently evaluated as monolithic systems, where the model, the harness, and the task set are bundled together. This conflation makes it impossible to determine whether a system's performance comes from the underlying model or the agent's specific tool-use and workspace management. Claw-SWE-Bench introduces a standardized adapter protocol that decouples the agent harness from the evaluation environment. This allows heterogeneous agents to run on a fixed set of 350 GitHub issue-resolution tasks under a unified budget and patch-submission contract. The benchmark reveals that harness choice is a first-order factor in performance, with accuracy spreads between harnesses often exceeding those between model tiers.

Paper Primer

The core mechanism is a shared adapter layer that forces diverse agent harnesses to interact with a standardized Docker workspace. Instead of parsing an agent's final natural-language response, the benchmark runner collects patches directly from the repository's Git state, ensuring that all systems are scored by the same objective criteria regardless of their internal logic.

Harness design is a primary driver of coding performance, independent of the underlying model.

In a five-claw sweep using the same Qwen 3.6-flash model, Pass@1 scores varied by 27.4 percentage points.

The adapter protocol is necessary for reliable scoring of general-purpose agents.

Using a minimal "bare" adapter resulted in a 19.1% Pass@1, while the full adapter protocol achieved 73.4% with the same GLM 5.1 backbone.

Why is it insufficient to just report the resolved rate of a coding agent?

Resolved rate ignores the resource cost of exploration. Because agents can differ by orders of magnitude in API cost and wall-clock time for the same task, cost-aware reporting is required to identify systems that are genuinely efficient versus those that simply rely on higher budgets.

Does this benchmark measure the model's intelligence or the harness's capability?

It measures both by treating them as separate experimental variables. By fixing the harness and sweeping models, or fixing the model and sweeping harnesses, the benchmark allows researchers to isolate the contribution of each component to the final Pass@1 score.

Researchers should stop treating agent harnesses as opaque implementation details. Future coding-agent evaluations must report total API cost and cache hit rates alongside accuracy to provide a complete picture of system utility.

Paper Primer

Introducing a cost‑aware benchmark that isolates coding‑agent harness performance.

Standard SWE‑bench evaluations conflate model quality, harness design, and task difficulty, making it impossible to attribute performance differences.

Claw‑SWE‑Bench introduces a unified adapter protocol, fixed prompt, runtime budget, workspace contract, patch‑extraction procedure, and evaluator so that any Claw (agent harness) can be compared on equal footing.

The full benchmark contains 350 multilingual GitHub issue‑resolution instances; a Lite subset of 80 instances is selected by a cost‑aware, rank‑aware procedure over 17 calibration columns.

Using this setup, OpenClaw with a minimal direct‑diff adapter attains 19.1 % Pass@1, while the full adapter reaches 73.4 % with the same GLM 5.1 backbone, demonstrating that adapter design is the dominant factor.

The paper introduces a cost‑aware benchmark for coding agents.

The Cost of Coding Agents

Benchmark exposes hidden cost dimension and introduces Claw‑SWE‑Bench with a lite variant.

Reporting only Pass@1 conflates accuracy with resource consumption, because a real coding agent performs many model calls, file edits, and command executions.

Consequently, two agents with identical Pass@1 can differ dramatically in token usage, wall‑clock time, and API cost, making the resolved‑rate metric misleading for cost‑sensitive deployments.

Figure 1 visualizes this gap: each of the 350 claw‑model combinations is plotted with Pass@1 (vertical) against total API cost (horizontal), and the black curve marks the Pareto frontier where no other point is both cheaper and more accurate.

To address the gap we introduce Claw‑SWE‑Bench, a benchmark that treats the agent harness as a controlled experimental variable and fixes the surrounding evaluation stack (prompt template, task set, execution container, timeout, patch extraction, evaluator).

The benchmark’s workload consists of 350 real GitHub issue‑resolution instances spanning eight programming languages and 43 repositories, drawn from SWE‑bench‑Multilingual and SWE‑bench‑Verified‑Mini, and evaluated with the upstream SWE‑bench evaluator.

Every system reports Pass@1 together with total API cost, average wall‑clock duration, and cache‑hit rate under a shared outer budget, enabling direct placement on the same accuracy‑cost Pareto plane.

For rapid prototyping we also provide Claw‑SWE‑Bench Lite, an 80‑instance subset that preserves the full benchmark’s language distribution, key rankings, and cost structure while cutting total cost to roughly 22.9 % of the full run.

The Lite subset is selected by a cost‑aware, rank‑aware method (§3.2) that optimizes resolve‑rate parity, pairwise ranking stability, and cost parity across 17 calibration columns; the resulting mean Pass@1 values differ by only 0.4 percentage points from the full set.

A K‑sweep shows the minimum acceptable per‑language instance count lies in K* ∈ [8, 10]; we release the conservative K = 10 point as the stable configuration.

Using this protocol we conduct two complementary studies: a model sweep (fixed OpenClaw harness, nine LLMs) and a claw sweep (fixed GLM 5.1 and Qwen 3.6‑flash models, five harnesses).

The model sweep demonstrates that a general‑purpose OpenClaw harness attains competitive Pass@1, confirming that a generic harness can be integrated into SWE‑bench‑style coding evaluation via the adapter protocol.

The claw sweep reveals harness choice as a first‑order factor: under GLM 5.1 the spread reaches 12.5 pp, and under Qwen 3.6‑flash it reaches 27.4 pp, enough to reorder leaderboard conclusions if the harness is omitted.

Overall, accuracy and cost do not move in lockstep; explicit control and disclosure of harness, budget, cost metric, and cache accounting are essential for fair SWE‑style benchmarking.

Defining the Evaluation Task

Defines the CLAW‑SWE‑BENCH benchmark and its workload composition.

Existing coding harnesses cannot be directly evaluated with the Software Engineering (SWE) benchmark because the evaluator expects a patch file, while agents like OpenClaw produce interactive sessions with logs and auxiliary files. This mismatch prevents a fair comparison of coding ability across different harnesses.

A unified benchmark that treats the agent harness as a variable, standardizing the patch‑prediction interface while preserving the full SWE‑bench task set.

The CLAW‑SWE‑BENCH workload draws 300 non‑Python tasks from the SWE‑bench‑Multilingual suite and 50 human‑validated Python tasks from SWE‑bench‑Verified‑Mini, yielding 350 real GitHub issue‑resolution problems across eight languages and 43 repositories.

The Adapter Protocol

Standardizes how diverse agent harnesses interact with the SWE‑bench benchmark.

The method standardizes how heterogeneous agent harnesses are evaluated against SWE‑bench. By fixing the surrounding evaluation stack, differences in reported metrics reflect harness design rather than incidental resource variations.

The protocol defines a thin, uniform interface that any agent harness must implement to plug into the benchmark lifecycle.

How does this differ from simply calling a harness’s native API directly?

Direct calls would bypass the benchmark’s fixed Docker image, prompt template, and patch‑extraction logic, so metrics would mix harness behavior with environment differences. The Adapter Protocol forces every harness to operate inside the same container and use the same task representation, isolating the harness’s algorithmic choices.

Claw is the wrapper that runs an agent inside a container, exposing the five adapter methods so the benchmark can control its lifecycle.

Why does Claw collect patches from the repository instead of parsing the agent’s final message?

Parsing free‑form text is brittle and would require each harness to conform to a common output schema. By requiring the agent to edit files, the benchmark can always compute a diff, guaranteeing a uniform patch format regardless of the harness’s native response style.

Load instance metadata (

Start the benchmark Docker image and mount the repository at

Reset the repository to

Instantiate the fixed task prompt using the template that includes the problem statement and base commit.

Invoke the harness via the Adapter Protocol’s

After the harness terminates (or times out), compute the diff against

Write the cleaned diff as a SWE‑bench‑compatible prediction and run the official evaluator.

Record Resolve‑Rate, wall‑clock duration (default $3600$ s), and any available token statistics.

Command‑line entry point used for all harness runs.

Step 1 loads the metadata:

Step 2 starts the Docker image and mounts

Step 3 removes any future commits, ensuring the agent cannot peek ahead.

Step 4 builds the prompt and sends it to OpenClaw via the adapter.

Step 5 the agent edits

Step 6 the runner computes the diff, yielding a

Step 7 the SWE‑bench evaluator runs the test suite; the patch passes, giving a Resolve‑Rate of 1.0 and a wall‑clock time of 842 s.

This concrete run shows how the adapter enforces a uniform environment while allowing the harness to express its solution through ordinary file edits.

Lite Benchmark Calibration

Lite‑80 provides a low‑cost, representative subset of the full benchmark for rapid iteration.

Lite‑80 is a low‑cost subset of the full‑350 benchmark that preserves Pass@1 performance, per‑language distribution, cross‑claw behavior, and cost structure.

How does Lite‑80 differ from a random sample of 80 instances?

Lite‑80 enforces language balance, a fixed difficulty‑quartile quota, and matches the full benchmark on 17 calibration columns, whereas a random sample would likely skew toward certain languages or difficulty levels and would not preserve the resolve‑rate and cost parity guarantees.

Experimental Setup

We detail the experimental design, variables, and measurement protocol.

We evaluate the interaction of two sources of variation—LLM choice and claw implementation—using the CLAW‑SWE‑BENCH benchmark.

Select a reference claw (OPENCLAW) and run the full model sweep across nine LLMs, recording PASS@1, TOTAL COST, and CACHE HIT RATE for each.

Fix two representative models (GLM 5.1 and QWEN 3.6‑FLASH) and run the claw sweep across the five claw variants, again measuring the three metrics.

Validate that the Lite‑80 subset reproduces the trends of the full 350‑instance set by comparing OPEN‑SQUIRLA’s PASS@1 on both workloads.

All runs share a common runtime configuration: 3600 s wall‑clock timeout per instance, concurrency = 3, executed on a 16‑core CPU server with 61 GiB RAM, no GPU.

Run a bare‑vs‑full adapter diagnostic with GLM 5.1 on the full‑350 workload to isolate the effect of the full adapter protocol.

Apply the leak‑fix protocol for multilingual tasks, removing future‑commit visibility before evaluation.

PASS@1 measures the fraction of coding tasks where the agent’s submitted patch is accepted as a correct solution.

Why is PASS@1 reported instead of a traditional accuracy score?

Because coding tasks are binary—either the generated patch solves the problem (RESOLVED) or it does not, so the fraction of resolved instances directly reflects practical success.

TOTAL COST captures the actual monetary expense incurred by invoking the LLM APIs for the entire benchmark run.

CACHE HIT RATE measures how often the system reuses previously retrieved tokens instead of issuing new API calls.

The evaluator returns RESOLVED → contributes 1 to the numerator of PASS@1.

The API billing log records 0.12 USD → added to TOTAL COST.

Cache read tokens = 150, input tokens = 50 → CACHE HIT RATE = 150 / (150 + 50) = 0.75.

This concrete trial shows how a single instance simultaneously influences all three metrics.

Performance and Cost Results

Key performance gains and cost trade‑offs across adapters and LLMs.

The benchmark measures both coding success (Resolve‑rate/Pass@1) and the practical cost of running agents, extending the original SWE‑bench premise with token usage and API expense.

Full adapter raises resolved rate from 19.1 % to 73.4 % and cuts apply‑failure to under 1.5 %.

Table 1 shows 67/350 vs 257/350 resolved and 69.1 % vs < 1.5 % apply failures.

**Figure 1.** **Resolve-rate–cost Pareto frontier.** Data are from the five-claw × two-model sweep in Table 3. Each point is one claw–model combination on the full 350-instance evaluation; the vertical axis is Pass@1 / resolved rate, and the horizontal axis is full-run total API cost (USD, log scale). The black line connects non-dominated operating points.

The table compares two configurations, "Bare adapter" and "Full adapter," across three metrics: "Resolved," "Pass@1," and "Apply Failed."

The table presents performance and cost metrics for various AI models, categorized into "Flagship" and "Flash" types. Columns include Model, Type, Resolved, Pass@1, Cost, Dur (Duration), In(M) (Input tokens in millions), Out(M) (Output tokens in millions), Turns, and Cache (hit rate).

Detailed Cost Analysis

Cost varies dramatically across models while cache hit rates stay near 97%.

Across the evaluated models, total cost spans more than \$1,300 while cache hit rates remain clustered around 97%.

Table 2 shows costs from \$8.20 (DeepSeek-V4 Flash) up to \$1,399.10 (GPT 5.5) with cache hit rates between 92.1% and 98.5%.

All experiments share a fixed evaluation pipeline: the same OpenClaw adapter, identical prompt templates, a uniform token budget, and the same evaluator implementation.

Future‑Commit Cleanup consistently reduces Pass@1: Claude Opus 4.7 drops 8.0 pp (84.7% → 76.7%), Kimi 2.6 drops 5.0 pp, and Qwen 3.6‑flash drops 2.0 pp; other models change by ≤1 pp.

Varying only the claw (harness) yields a 12.5 pp Pass@1 spread for GLM 5.1 (60.9%–73.4%) and a 27.4 pp spread for Qwen 3.6‑flash (38.6%–66.0%), confirming that the harness alone can dominate performance differences.

Claw Axis Variation

We compare five Claw agents across two models, highlighting cost‑aware trade‑offs and Pareto‑optimal points.

**Table 3.** Cost-accuracy analysis of Figure 1. Cost further changes how claw rankings should be interpreted. Figure 1 is a two-dimensional projection of Table 3: for each claw-model combination, the horizontal axis uses the full 350-instance TOTAL COST, and the vertical axis uses Pass@1 from the same row; OpenClaw × GLM 5.1 follows the leak-fix cost accounting in Table 2. The Pareto frontier consists of points that are not dominated by another combination with both lower cost and higher Pass@1.

Related Work and Limitations

We wrap up the benchmark contributions, highlight key findings, and outline current limits.

We introduced CLAW‑SWE‑BENCH and its lightweight 80‑instance LITE variant, and showed that a general‑purpose agent such as OpenClaw can be reliably evaluated under a fixed protocol. The experiments demonstrate that harness design is a first‑order factor that can reorder model rankings and alter accuracy‑cost trade‑offs.

SWE‑bench is a collection of real‑world GitHub issues each paired with a repository‑level test suite; it measures whether an agent can produce a patch that makes the suite pass, thereby evaluating end‑to‑end coding‑agent capability.

**Figure 4.** Effect of future-commit cleanup on the OpenClaw model sweep. After cleanup, Pass@1 does not increase for any of the nine models; drops range from 0.6 to 8.0 pp.

Our study has several limitations: results are based on single‑run aggregates, so small percentage‑point differences may not reflect stable superiority; the claw sweep covers only five harnesses and two models, leaving many interaction effects unexplored; and cost analysis relies on provider‑side pricing and cache metrics, which can change and lack raw token traces for independent verification.

Reproducibility Statement

All code, data, and steps needed to reproduce the benchmark are provided.

This appendix gathers every artifact and protocol required to reproduce the benchmark results reported in §4.

Five harness‑adapter packages (`openclaw_swebench`, `hermes_swebench`, `zeroclaw_swebench`, `nanobot_swebench`, and a generic baseline) are released, each containing `run_infer`.py, `run_eval`.py, the orchestrator, agent adapter, and workspace modules. The same registry also provides the minimal bare adapter used for the diagnostic in §5.1.

A Node.js toolkit implements the cost‑aware, rank‑aware Lite selection, K‑sweep, sensitivity checks, quartile‑stratification logic, and final‑report generators. Additionally, matplotlib scripts (`generate_pareto_figure`.py, `generate_leak_fix_figure`.py, `generate_lite_figures`.py) recreate all published figures from the released result workbooks.

Both Claw‑SWE‑Bench (350 instance IDs) and Claw‑SWE‑Bench‑Lite (80 instance IDs) are released as JSON files, together with the curated metadata. The underlying issues and repositories are drawn from the upstream SWE‑bench‑Multilingual and SWE‑bench‑Verified‑Mini sources, with only the selected ID sets redistributed.

Reproduction follows a two‑step CLI workflow. Inference is launched with python3 `run_infer`.py specifying the harness, dataset language, model identifier, run label, timeout, and worker count. Evaluation then runs python3 `run_eval`.py on the generated predictions, pointing to the SWE‑bench dataset and allowing up to eight parallel workers.

Each (instance, harness, model) cell is executed once with three‑thread concurrency; this single‑run choice trades statistical robustness for cost efficiency. A multi‑seed validation on a 50‑instance slice is left to future work.

Hardware requirements are modest: a 16‑core CPU server with 61 GiB RAM running Linux suffices, and no GPU is needed because all model inference is performed via external APIs.

Compute and Environment

Details of hardware, software, parameters, and compute cost for the experiments.

All five claws (OPENCLAW, HERMES‑AGENT, ZEROCLAW, NANOBOT, and the GENERIC baseline) were run on a single host with identical run‑time settings; the harness implementation is the sole experimental variable.

The host runs an `x86_64` Linux 6.8.0‑106‑generic system, offering 16 CPU cores, 61 GiB of RAM (≈44 GiB usable), and a 99 GiB disk with about 75 GiB free; no GPU is present because model inference is performed via remote APIs.

**Table 4.** Host hardware. Model inference runs on remote provider APIs, so no local GPU is required; the host is used only for harness orchestration, Docker containers, and patch evaluation.

The software stack binds the SWE‑bench harness and per‑claw virtual environments into the evaluation container, using Docker images for isolation and CPython 3.12.13 (uv‑installed) for the Python agents.

**Table 5.** Software stack per claw. Standalone Python and the harness virtualenvs are bind-mounted into the SWE-bench evaluation container so that the agent loop runs inside the same container as the patched code.

Uniform run‑time parameters were enforced via CLI flags: a per‑instance timeout of 3600 seconds, a single repeat per instance, three concurrent threads, high reasoning effort for hermes, and a 180‑second shell‑tool timeout.

**Table 6.** Run-time parameters. These are overridden via CLI flags so that all five claws see the same per-instance budget; the prompt template (Appendix C) is also identical across claws.

Four API providers supplied the remote model calls: OpenRouter (default routing), DashScope (Alibaba), Infini‑AI, and DeepSeek (including the V4 Pro/Flash variant).

**Table 7.** API providers used during experiments. Only base URLs and model identifiers are released as part of the artifact; API keys are NOT included in any released artifact (see Appendix F).

The full experimental sweep covered 17 (claw, model) columns across 350 instances each, amounting to roughly 1,148 hours of wall‑clock time (≈47.8 days single‑thread, ≈15.9 days on the three‑thread schedule); the model sweep contributed about 671 hours and the remaining claws about 477 hours.

These durations include remote API latency and represent end‑to‑end operating cost; they exclude adapter diagnostics, pre‑cleanup runs, and Lite‑80 validation runs.

Harness Configurations

Standardized prompts, runtime, and orchestration while varying inner harness implementations.

All five claws share the same prompt template (D.0), the same run‑time limits (3600 s timeout, concurrency 3, one repeat), and the same outer orchestration: build the prompt, execute via Docker, then collect the git diff.

The only difference among claws lies in their inner harness implementation (D.1–D.5), which changes the CLI surface, the agent loop, the tool set, and the model adapter.

This uniformity forms the methodological basis for the claw sweep in §5.4, keeping prompt and run‑time budget fixed while treating the harness as the experimental variable.

The shared prompt template (D.0) is:

The environment rules prohibit git add/commit and any modifications to test files.

Typical commands used by the harness include listing files, reading files, grepping, editing with sed, running tests, checking diffs, and writing temporary scripts.

The harness expects an block containing the problem statement, which it then uses to guide the three‑phase resolution process.

Phase 1 reads and rephrases the problem, highlighting relevant details; Phase 2 builds and runs the tests; Phase 3 explores the repository to locate and fix the relevant files.

Adapter Protocol Details

Details of the openclaw adapter wrapper, its tool deny‑list, and per‑instance execution flow.

Phases 4–8 define a disciplined workflow for reproducing, fixing, and verifying a bug in a SWE‑bench instance.

Phase 4 creates a minimal reproduction script: first inspect existing tests (4.1), then write a script that isolates the failure (4.2), run it to confirm the bug (4.3), and iterate on the script until it reliably reproduces the issue (4.4).

Phase 5 articulates the problem and its fix: state the exact symptom (5.1), locate the offending code (5.2), describe how the test triggers the bug (5.3), list best‑practice constraints for the fix (5.4), and finally prescribe the concrete change (5.5).

Phase 6 applies a minimal, focused edit to the source (6.1).

Phase 7 validates the change: rerun the reproduction script (7.1), extend it with edge‑case scenarios (7.2), and execute the full test suite to ensure no regressions (7.3).

Phase 8 performs a final audit against the original problem description and the base commit, confirming coverage of the targeted issue (8.1), re‑running relevant tests (8.2), and iterating until all tests pass (8.3).

The openclaw adapter wrapper is a stateful Node.js harness that spawns a temporary per‑instance openclaw agent with its own workspace, session directory, and a tool deny‑list.

Implementation provides a Node.js entry point (openclaw.mjs) that manages full agent lifecycle—creation, deletion, tool‑deny configuration, and session backup—ensuring each SWE‑bench instance runs in isolation.

The wrapper configures a deny‑list that disables memory, web, session‑spawning, sub‑agent, cron, and image tools, leaving only file I/O and process execution available.

The status table enumerates each tool with a concise note indicating whether it is allowed or denied for the openclaw harness.

The scaffolding loop assigns a unique agent ID per SWE‑bench instance, creates a dedicated workspace, applies the deny‑list via the JSON config, launches the container, and executes the task‑solving loop inside the isolated agent.

Claw Implementations

Implementation details for each Claw harness variant.

openclaw is a Node.js harness that executes ReAct‑style reasoning, returning structured JSON either via stdout (gateway mode) or embedded in the output.

openclaw invocation – runs a single agent with a timeout and JSON result.

openclaw enforces a 3600 s wall‑clock limit per instance, classifies finish reasons (stop, error, empty, timeout), and permits a single retry.

hermes‑agent is a Python harness that runs statelessly via a CPython runtime; create/delete/backup calls are no‑ops because the adapter invokes hermes chat directly.

hermes‑agent invocation – stateless call with terminal and file toolsets.

Read the original paper

Open the simplified reader on Paperglide