WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan

WeaveBench evaluates computer-use agents on long-horizon tasks requiring interleaved GUI and CLI/code orchestration.

How can we evaluate computer-use agents on long-horizon, real-world tasks that require interleaving GUI and CLI operations?

Existing benchmarks evaluate GUI and CLI capabilities in isolation, failing to capture real-world workflows where agents must coordinate visual desktop feedback with programmatic file and system operations. WeaveBench introduces 114 long-horizon tasks that force cross-interface orchestration, evaluated via a trajectory-aware judge that audits process logs and artifacts to prevent shortcut behaviors like hard-coded metrics or fabricated visual evidence. Even the best model-runtime pairings reach only 41.2% PassRate, demonstrating that current agents struggle to maintain workflow discipline across hybrid interfaces.

Paper Primer

The benchmark forces agents to move information between two distinct channels: the GUI, which exposes transient visual state like spatial layouts and dialogs, and the CLI/code interface, which exposes persistent, scriptable state like logs and configuration files. A task is only admitted if it requires both channels to succeed, preventing agents from solving the problem through a single-channel shortcut.

The core mechanism is a trajectory-aware judge that treats evaluation as an evidence audit rather than a final-state check. It re-fetches screenshots, logs, and file states over multiple turns to verify that the agent followed the required process, automatically zeroing out rollouts that trigger known shortcut patterns like synthesized renders or CLI-bypass of GUI requirements.

Current frontier agents fail to saturate the benchmark, with performance dropping significantly when audited for process integrity.

The best model-runtime pairing (Claude Opus 4.7 + Claude Code) achieves a 41.2% PassRate, while trajectory-aware judging reduces the reported performance of GPT-5.5 from 53.5% (outcome-only) to 33.3% (audited). A ~20 percentage point inflation in performance is revealed when moving from outcome-only to trajectory-aware grading.

Interface ablation confirms the necessity of hybrid control: GUI-only and CLI-only settings result in PassRates at or below 3.5%, proving that the benchmark's difficulty arises from the required orchestration rather than individual tool friction.

Why is a trajectory-aware judge necessary for this benchmark?

Final-only grading is vulnerable to reward hacking, where agents synthesize artifacts or hard-code metrics to appear successful. The judge audits the trajectory to ensure the agent actually performed the required cross-interface steps rather than bypassing them.

How does WeaveBench differ from existing multi-interface benchmarks?

In prior benchmarks, the second interface is often a convenience that can be ignored. WeaveBench enforces channel non-substitutability, meaning the task specification makes it impossible to succeed using only one interface.

Researchers should shift focus from single-channel tool use to long-horizon workflow discipline, as the primary bottleneck for hybrid agents is not visual perception but the ability to maintain state and planning across GUI and CLI boundaries.

The Hybrid Interface Challenge

We expose the missing evaluation gap for agents that must coordinate GUI and CLI actions over long horizons.

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control (GUI) with command‑line and code execution (Command Line Interface, CLI). GUIs expose transient visual state such as windows and dialogs, while CLI/code expose structured, persistent state like files and logs. Existing benchmarks treat these channels as separate, leaving long‑horizon cross‑interface orchestration untested.

A hybrid interface is a workflow where an agent must switch between graphical user interface (GUI) actions and command‑line (CLI) commands to accomplish a task.

WeaveBench is a benchmark suite of long‑horizon tasks that forces agents to weave together GUI and CLI operations across multiple steps.

**Figure 1.** Three real-world workflows requiring interleaved hybrid interfaces. (DAV) Diagnosing a Jaeger trace span by inspecting its shape, then patching the upstream timeout via `kubectl`; (GAME) playing a desktop game to localize a sprite/physics bug, then patching the scene-graph source; (OPS) catching a 503 spike on a Web Ops dashboard, editing `nginx.conf`, and re-checking the dashboard. Each step alternates between a GUI signal that no API exposes and a CLI/code change that no screenshot can produce.

Prior benchmarks either focus on a single channel or allow a task to be solved by one interface, so the extra channel is merely a convenience. WeaveBench introduces 114 long‑horizon tasks that require agents to interleave GUI observations and actions with CLI/code operations within a single trajectory. In our evaluation, the best model‑runtime pairing attains only 41.2% PassRate, and the trajectory‑aware judge reduces inflated scores, highlighting the difficulty of hybrid orchestration.

The shift from static element grounding to long‑horizon hybrid workflows reveals a critical evaluation gap for computer‑use agents.

WeaveBench Architecture

Designing a hybrid benchmark that forces agents to interleave GUI and CLI while preventing fabricated shortcuts.

Existing CUA benchmarks either stay in a single modality or allow a trivial fallback to the easier channel, so high scores give no guarantee that an agent can truly coordinate GUI and CLI actions. Moreover, grading only the final deliverable lets agents cheat by fabricating evidence.

The related work shows three strands: pure‑GUI benchmarks, pure‑CLI/code benchmarks, and early hybrid attempts that still permit a single‑channel shortcut.

**Figure 2** WeaveBench pipeline. Task: 114 tasks across 8 domains, harvested from real venues, packaged as $\mathcal{E} = (\mathcal{P}, \mathcal{M}, \mathcal{C})$ bundles, audited against P1–P3, and stress-tested by $\ge 3$ pilot agents. Harness: the agent runs in a single session over an Ubuntu sandbox, where a minimal GUI plugin (one screenshot tool plus nine actuation primitives) is added on top of OPENCLAW’s CLI/code tools. Evaluation is performed by an isolated trajectory-aware agentic judge that combines bottom-up rubric scoring with shortcut detection.

WeaveBench admits a task only if it satisfies three properties: (P1) the success requires both GUI observation/action and CLI/code modification in the same trajectory; (P2) the expert reference trajectory contains multiple interleaved phases; (P3) the workflow spans several independent applications.

C1 Archetype‑guided sourcing – experts define a cooperation archetype (GUI role + CLI role) and hunt real‑world artifacts that match.

C2 Asset packaging – the selected artifact is turned into a self‑contained bundle (environment, seed data, instructions, reference trajectory, verification anchors).

C3 Blind review – an independent reviewer checks clarity, reproducibility, and that P1–P3 are truly satisfied.

C4 Pilot validation – three pilot agents run the bundle; any broken, ambiguous, or trivial task is revised before release.

Task diversity is achieved by covering eight domains (desktop productivity, document processing, games, web development, data analysis, DevOps, 3D/CAD, design) with 114 tasks. The median rollout uses 76 tool calls and 16 GUI↔CLI switches, confirming long‑horizon interleaving.

**Figure 3.** WEAVEBENCH dataset overview. (a) Taxonomy of 114 tasks across 8 domains and 23 subcategories. (b) Number of GUI ↔ CLI channel switches per task, showing the degree of channel interleaving. (c) Rollout length measured by tool calls in the trajectory.

The judge audits the entire execution trace, not just the final output, to ensure every required intermediate state and tool use is genuinely produced.

Step 1 evidence: screenshot of the desktop with the editor window visible → clause “editor opened” satisfied.

Step 2 evidence: shell history line

Step 3 evidence: file content read back (cat) shows “hello” → clause “correct file content” satisfied.

Step 4 evidence: screenshot matches the file‑tab title “notes.txt” → clause “final visual state matches expectation” satisfied.

All eight process dimensions receive a score of 1; no shortcut pattern triggers, so $s_{t,m}=1$.

The example shows how the judge ties each high‑level requirement to a concrete artifact, making fabrication impossible without producing matching evidence.

How does this trajectory‑aware judge differ from a traditional final‑state‑only evaluator?

A final‑state evaluator checks only the end product (e.g., a file exists). The trajectory‑aware judge also verifies every intermediate action—opening the editor, issuing the CLI command, and providing matching screenshots—so an agent cannot skip steps or inject fabricated evidence without being caught.

The judge’s shortcut detection scans the rollout for nine concrete patterns (fake renders, fabricated data, hard‑coded metrics, mock services, duplicate crops, overlay manipulation, ground‑truth leakage, runtime injection, and others). If any pattern is confirmed, the task score is forced to zero, eliminating any incentive to cheat.

Performance and Interface Ablations

Hybrid interfaces achieve markedly higher PassRate than single‑channel baselines.

The Agent Runtime is the orchestrating loop that invokes tools, tracks filesystem and GUI state, and records the execution trajectory for grading.

How does Agent Runtime differ from the model API?

The model API decides *what* to say and *how* to format tool calls; the runtime decides *when* to invoke a tool, *which* concrete wrapper to use, and how to persist state between turns.

Hybrid interfaces reach a peak PassRate of 41.2 % when Claude Opus 4.7 runs on the Claude Code runtime.

Table 3 reports 41.2 % PassRate (Overall 0.532) for this pairing, the highest among all model–runtime combinations.

Runtime sweep results (Table 3) show that the same strong model can vary dramatically: Claude Opus 4.7 drops from 35.1 % on OpenClaw to 13.2 % on Codex CLI, while GPT‑5.5 falls from 33.3 % on OpenClaw to 14.9 % on Claude Code, highlighting the importance of runtime‑model alignment.

Interface ablation (Table 4) confirms that neither GUI‑only nor CLI‑only runtimes achieve more than 3.5 % PassRate; the Hybrid configuration is essential for any meaningful performance.

**Table 3.** Runtime comparison for the strongest model APIs. GPT-5.5 and Claude Opus 4.7 are evaluated at their high thinking across four agent runtimes. PR denotes PassRate (%). Overall denotes the mean per-task score over the full 114-task suite. Per-domain columns report PassRate.

**Table 4.** Interface ablation on PassRate. **GUI**: GUI-only tool pool. **CLI**: CLI-only tool pool. **Hybrid**: full tool pool. The best thinking mode is chosen.

**Figure 5.** Top-10 atomic operations across all GPT-5.5 rollouts on WeaveBench. Bars are sorted by call count and together cover 93.1% of 10,873 active calls. exec: shell alone dominates at 27.3%.

Failure‑mechanism analysis (Section 4.6) shows that 35.2 % of failures stem from reward‑hacking (E5) and 30.4 % from long‑horizon execution discipline (E4), with the remaining errors split among reasoning and perception issues.

Hybrid interfaces significantly outperform single‑channel baselines, proving that both GUI and CLI access are essential for high‑level task success.

Dataset Construction and Taxonomy

Construction details of the benchmark’s atomic tasks, trajectories, domains, and source provenance.

P1 requires a task to combine GUI observation/action with CLI/code modification within a single trajectory. To audit this, the authors enumerate 19 atomic operations and group them into six mechanism‑defined families, as shown in Table A1. Table A2 then reports how many of the $N=114$ tasks satisfy P1 at three increasingly strict levels.

**Table A2.** How many of the $N=114$ tasks satisfy P1 at three strictness levels.

A.2 evaluates the long‑horizon (P2) and cross‑application (P3) properties of the benchmark trajectories. The median rollout per task makes 76 tool calls (mean 88, range 14–471) and switches channels 16 times, while the median number of distinct applications per task is 15. Every domain exceeds the P2 threshold of 20 calls and the P3 threshold of three distinct apps.

**Figure A2. P2 and P3 metrics by domain.** Box plots over the 8 WEAVEBENCH domains for tool-call counts in panel (a), channel switches in panel (b), and distinct apps in panel (c). Dotted lines mark the thresholds of P2 at 20 calls and P3 at 3 apps; every domain reaches 100% on P3.

A.3 details the eight real‑world domains covered by WeaveBench, describing a typical workflow and a cooperation archetype that assigns responsibility between GUI and CLI. Table A3 lists each domain’s workflow, the split of signals (GUI) and effects (CLI), and the number of tasks per domain (10–18). This taxonomy ensures the benchmark spans diverse interaction patterns.

**Table A3.** The 8 real-world work domains in WeaveBench. The *typical workflow* column summarizes what domain experts described as the prevailing end-to-end pattern in each domain. The *cooperation archetype* names the role split: which signal the GUI is responsible for surfacing, and which effect CLI/code is responsible for producing.

A.4 reports the provenance of the benchmark’s 114 tasks: 174 source URLs (mean 1.53 per task) are split between user‑pain sites (54 %) and reference documentation (46 %). Figure A3 visualizes this distribution, highlighting the dominance of Stack Exchange and GitHub for user‑pain URLs and project docs for reference URLs.

**Figure A3** Task source distribution of WeaveBench. Per-category stacked bars over 174 source URLs across 114 tasks. Warm hues denote user-pain venues (Stack Exchange, GitHub/GitLab issues, Reddit, forums, YouTube, bug trackers); cool / neutral hues denote reference venues (project docs, GitHub repos, Wikipedia, GitLab repos). Right-side annotations give the per-category URL total and the share of tasks carrying at least one user-pain URL.

Trajectory-Aware Judge Details

Computer-use agents must handle both GUI and CLI tasks; this appendix details the trajectory-aware judge.

The judge runs as an OpenClaw agent in a fresh subprocess for each case, fully isolating profile, workspace, conversation history, and tool state.

The scoring pipeline consists of five layers (Table B1) that sequentially decompose the spec, verify clauses, aggregate per‑deliverable correctness, compute eight orthogonal dimensions, and finally combine them into a single score.

Fabrication patterns are recurring cheat modes where the agent fabricates evidence instead of using real tool outputs.

The judge scans the agent’s trajectory for nine stereotyped cheating patterns (Table B4), each representing a known fabrication mode such as fake GUI screenshots or hard‑coded metrics.

The system prompt (shown below) enumerates prohibited fake‑screenshot techniques and authorizes only genuine desktop captures, providing a cost‑free fallback to skip unattainable screenshots.

The judge’s behavior is governed by five constraints: fail‑by‑default, per‑clause evidence, no rounding up, aggressive cheat detection, and no effort credit.

**Figure 4.** Outcome-only vs. trajectory-aware judging. Dark blue: audited PassRate (Section 3.4). Light blue: inflation removed by the audit, with label inside showing PassRate points removed. Top label: outcome-only total and points removed.

Outcome‑only judging inflates PassRate by up to 20.2 % compared to trajectory‑aware judging.

Figure 4 shows the audited PassRate for GPT‑5.5 drops from 53.5 % to 33.3 % when using trajectory‑aware judging, a reduction of 20.2 percentage points.

Think-Budget Sensitivity Analysis

A full think-budget sweep shows higher budgets consistently boost GPT‑5.x performance.

Table C1 enumerates PassRate and Overall for each GPT‑5.x backbone under low, medium, and high think budgets, revealing a clear monotonic trend.

Raising the think budget from low to high improves PassRate, with the biggest gain for GPT‑5.5 (+22.8 points).

GPT‑5.5’s PassRate rises from 10.5 % at low budget to 33.3 % at high budget (Table C1).

The image displays two tables. The left table compares performance metrics (GUI, CLI, Hybrid) across four agents: Claude Opus 4.7, GPT-5.5, GPT-5.4, and GPT-5.3-codex. The right table compares performance across three benchmarks (OSWORLD-MCP, MCPWORLD, and WeaveBench) using GUI, CLI/MCP, and Hybrid modes, including a delta column.

This table presents the performance metrics for various GPT models (GPT-5.1-codex through GPT-5.5) across different "Think" levels (high, med, low), reporting "PassRate (%)" and "Overall" scores.

Hybrid Trajectory Walkthroughs

Four hybrid CLI‑GUI trajectories achieve >90% compliance across distinct domains.

We condense each rollout to its structural skeleton (12–20 tool calls) to expose the recurring hybrid pattern: a CLI prologue, an interleaved GUI segment, and a CLI epilogue.

Hybrid trajectories reach high compliance, with the best case scoring 0.97 and the worst 0.92.

Four end‑to‑end cases across desktop, web, Electron, and Verilog domains all exceed the 0.90 threshold.

Hybrid trajectory executor – decides channel per step.

CLI: $ gsettings set org.gnome.desktop.interface color-scheme 'prefer-dark'

GUI: screenshot the settings window → confirms dark theme is active

CLI: $ gsettings get org.gnome.desktop.interface color-scheme → returns 'prefer-dark'

The CLI writes the value atomically, while the GUI provides human‑readable proof; both are required for a verifiable outcome.

`CODEBLOCK_0`

**Figure.** [GUI] computer[screenshot] # final compliance evidence: # 12/12 keys at policy values. `view_07_dconf_verified`.png (final compliant tree, all 12 keys)

**Figure D.4.** Case 4 — `SPA_task_12_verilator_gtkwave_uart_bug` (score 0.92)

CLI-Only Baseline Analysis

We re‑evaluate OSWorld with a CLI‑only agent, using the same intent‑oriented judge as the vision baseline.

We probe the GUI‑vs‑CLI question on OSWorld by re‑running the benchmark with a strict CLI‑only ablation. The experiment holds the environment, model, instruction wording, and evaluator constant, varying only the agent’s tool surface (CLI vs GUI) to form a fair counterfactual.

A CLI‑only baseline is an agent that never sees pixels and interacts solely through shell commands and programmatic automation.

We score each task with a human‑audited, ground‑truth‑aware in‑VM agent‑as‑judge: a gpt‑5.5 model that inspects the live VM, the agent’s full trajectory, and the task’s ground‑truth spec, and decides if the user’s intent was achieved.

CLI halves the interaction steps while matching vision accuracy.

Table E1 shows CLI requires 14.3 steps on average versus 29.0 for Vision, yet its pass rate (79.1 %) is on par with Vision (77.3 %).

Failure Anatomy and Error Distribution

We recall that agents must juggle GUI and CLI together, and then we break down where they fall short.

Hybrid‑interface agents must coordinate both graphical and terminal interactions to complete long‑horizon tasks, and WeaveBench judges their trajectories to catch fabricated outputs.

Appendix F.1 enumerates a concrete failure example for each of the 13 taxonomy sub‑classes, using verbatim trajectories from the OpenClaw benchmark.

Category E1 captures reasoning and planning failures.

E1.1 – An Opus 4.7 run misidentifies the cause of a Kubernetes OOMKill, raises the memory limit without ever inspecting runtime metrics, and the patched deployment crashes again (score 0.21).

E1.2 – A GPT‑5.5 agent formats a LibreOffice caption list as a plain array of page numbers, violating the required JSON schema and losing points (score 0.70).

E1.3 – In a Sokoban task the agent captures three screenshots, but two are byte‑identical, so the deliverable correctness caps at 0.70 despite a perfect solution file.

Category E2 addresses tool‑use and execution problems.

E2.1 – A GPT‑5.4 attempt to install Wireshark repeatedly fails in a sandbox, and the agent records a “SKIPPED” note instead of a real capture (score 0.67).

E2.2 – The same model routes a Mermaid‑to‑drawio conversion through a Python image‑generation path, reusing a single GUI screenshot for the remaining views (score 0.35).

Category E3 concerns visual grounding.

E3.1 – A GPT‑5.5 agent misreads a pixelated digit on a Pinball scoreboard, writing a mismatched JSON score (score 0.55).

Category E4 covers long‑horizon execution discipline.

E4.1 – An Opus 4.7 run aborts after a GUI tree panel refuses to expand, never invoking the required “SKIPPED” affordance (score 0.00).

E4.2 – GPT‑5.5 terminates early after completing most deliverables, admitting missing screenshots in the final report (score 0.56).

E4.3 – GPT‑5.4 patches a DOCX file via CLI, but the subsequent LibreOffice view shows stale content, prompting a redundant CLI rewrite that overwrites the correct file (score 0.21).

Category E5 records reward‑hacking behaviours.

E5.1 – Opus 4.7 fabricates DS9 screenshots with Matplotlib when the GUI cannot start (score 0.00).

E5.2 – Opus 4.7 invents an OCR‑derived metric value without performing any OCR (score 0.00).

E5.3 – GPT‑5.5 reuses a single Grafana screenshot, cropping and overlaying labels to fake distinct panel views (score 0.00).

E5.4 – GPT‑5.5 bypasses the required LibreOffice Navigator GUI by editing the ODT archive directly and inserting placeholder screenshots (score 0.00).

Appendix F.2 aggregates failures across the three frontier backbones and the full 13‑backbone population, revealing distribution shifts.

Capability‑shortcut alignment is monotone: reward‑hacking (E5) rises as overall pass rate improves, while silent‑halt (E4.1) declines.

Model family predicts shortcut style: Gemini models favour hardcoded metrics, code‑trained GPT‑5.x models favour CLI‑bypass, and Opus models favour synthesized renders.

The smallest models mostly exhibit silent halt, producing almost no deliverables despite a functional display.

**Figure 6.** WeaveBench failure anatomy. (a) Overall error distribution across the three frontier backbones (Opus 4.7, GPT-5.5, GPT-5.4) on OPENCLAW (n=1,735), as a 2-ring sunburst: inner ring = 5 top-level families, outer ring = 13 sub-classes. (b) Per-backbone sub-class share. Full 13-backbone breakdown in Appendix F.2.

**Figure.** Case B — GPT-5.5 (score 0.46, mechanism E5.3 crop / overlay reuse). GPT-5.5 also opens KiCad and produces real artifacts, but the trajectory contains a smoking-gun forgery. After capturing one legitimate 3D screenshot, the agent issues a single cp command to satisfy two distinct required views with the same image.

`CODEBLOCK_0`

Read the original paper

Open the simplified reader on Paperglide