Agents’ Last Exam
Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu
ALE is a benchmark of 1K+ expert-authored, long-horizon professional workflows designed to measure AI economic utility.
How do current AI agent systems perform on complex, multi-step professional workflows that require real-world software interaction?
Current AI benchmarks measure abstract competence but fail to capture the long-horizon, tool-intensive workflows that define professional economic output. The authors introduce Agents’ Last Exam (ALE), a benchmark of 1,490 expert-authored task instances across 55 digital industries, where agents must interleave GUI interaction, CLI operations, and domain-specific software to produce verifiable professional deliverables. Frontier agents currently struggle to clear these tasks, with the hardest tier recording near-zero pass rates and even the easiest tier remaining far from saturated.
Paper Primer
ALE targets the Generalist Computer-Use Agent (GCUA), a system capable of managing an entire action loop across five functional layers: Brain (reasoning), Eyes (GUI perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). The benchmark forces this integration by requiring agents to navigate real professional software environments rather than synthetic or CLI-only sandboxes.
ALE is significantly harder than existing CLI-focused benchmarks.
Codex with GPT-5.5 achieves an 82% pass rate on Terminal-Bench but records only a 25.2% overall pass rate on the ALE-CLI subset. The gap indicates that ALE’s professional workflows demand capabilities beyond those tested by current terminal-centric evaluations.
Frontier agent performance remains low on the benchmark's most difficult tier.
Across mainstream harness and backbone configurations, the average full pass rate on the "Last Exam" tier is 2.6%. Most mainstream agents, including Claude Code, record near-zero pass rates at this difficulty level.
Why does this benchmark rely on expert-authored tasks rather than synthetic ones?
Synthetic tasks often fail to capture the complexity and "real-world" messiness of professional practice. By sourcing tasks from projects experts have already shipped, ALE ensures that agents are evaluated on authentic, economically valuable workflows that require professional judgment.
How does ALE handle the heterogeneity of professional deliverables?
ALE avoids open-ended LLM-as-judge scoring. Instead, it uses deterministic checks (e.g., hashed values, geometric distances) or narrow, evidence-anchored rubrics to verify artifacts like CAD files, code, or rendered media.
Introduction and Motivation
We expose the evaluation gap between current benchmarks and real professional workflows, introducing ALE to bridge it.
Recent AI systems excel on many benchmarks, yet these gains have not translated into economically meaningful deployment. The core issue is an evaluation gap: existing benchmarks measure isolated capabilities rather than sustained performance on real professional workflows. Agents' Last Exam (ALE) is introduced to fill this gap by providing a high‑fidelity, environment‑based benchmark of long‑horizon, economically valuable tasks.
A Professional Workflow is a multi‑step process that intertwines GUI interaction, command‑line operations, and domain‑specific software to produce a verifiable deliverable. Measuring agents on such workflows reveals whether they can replace or augment human expertise in real industry settings.
**Figure 1.** **Agents' Last Exam** spans a broad taxonomy of professional tasks and realistic workflows.
**Figure 3.** Benchmark positioning map. Prior benchmarks are placed by mapping their published domains onto the ALE domain taxonomy.
The persistent gap between benchmark scores and economic utility underscores why ALE’s real‑world focus is essential for translating AI progress into GDP‑relevant impact.
Benchmark Design and Taxonomy
Defines the ALE benchmark’s task criteria, taxonomy, construction pipeline, and release strategy.
The benchmark is built around three non‑negotiable requirements that filter which professional workflows become ALE task instances.
ALE curates full‑workflow professional tasks that use the same software a domain expert would, are substantial end‑to‑end deliverables, and produce outputs that can be checked automatically or with a clear rubric.
**Figure 2.** Distribution of 1,490 task instances across the ALE taxonomy. Each row is one of the 55 subdomains, grouped under the 13 top-level domains (parenthetical numbers give domain totals). Stacked bars decompose each subdomain into fully-implemented instances (domain color) and expert submissions awaiting Quality Control (QC) Process (orange). All 55 subdomains receive non-zero coverage. Current runnable task instances target either Linux or Windows virtual machines.
To make the taxonomy comparable across benchmarks, the authors map each prior benchmark’s categories onto the same 55‑subdomain scheme using an LLM‑assisted classifier, revealing a 13‑subdomain gap that no existing suite covers.
**Figure 4.** Task construction pipeline. Tasks proceed from expert sourcing through submission, first-pass review, engineering implementation, and final quality control.
Expert outreach recruits domain specialists via an advisory committee to ensure coverage of all ALE subdomains.
Experts submit proposals through a dedicated web portal, providing a natural‑language description, input files, target software, expected deliverable, and evaluation specification.
First‑pass review assigns conference‑style decisions (major/minor revision, borderline accept, accept, strong accept) and routes required revisions back to the expert.
Engineers convert accepted specifications into runnable containers, perform dry‑runs, and feed any gaps back to the expert for clarification.
The QC committee conducts a peer review of reference outputs, calibrates evaluation bounds, and validates contextual sufficiency before admitting the task to the benchmark.
The portal records all five components and flags missing fields.
Auto‑review runs a lightweight validator that checks file format compatibility and software version availability.
Engineers spin up a SolidWorks container, execute the conversion script, and verify that the generated STL hash matches the reference.
QC reviewers confirm that the reference STL was produced by a senior CAD engineer and that the hash comparison is deterministic.
The task is approved and added to the private pool of ALE instances.
This concrete walk‑through shows how each of the five required components is validated before a task reaches the public benchmark.
**Figure 5.** Provenance and review yield. The 1,490 task instances split into 960 external submissions (top, by first-pass review verdict) and 530 commissioned tasks (bottom). Each bar is segmented by release state: 150 public, 1,017 private and 323 unverified pending QC.
ALE mitigates benchmark contamination by exposing only 150 of the 1,490 task instances (≈10 %) publicly; the remaining tasks stay private and are periodically rotated into the public set while retired tasks are retired, preserving an uncontaminated evaluation surface.
**Figure 10.** Software ecosystem covered by ALE tasks. Each icon is a distinct application or toolchain that appears in at least one task workflow, positioned within its primary ALE domain. Overlap regions hold tools that span multiple domains (e.g., creative-suite applications shared between Visual & Media Arts and Engineering). The figure is qualitative; quantitative per-subdomain instance counts appear in Figure 2.
The Evaluation Pipeline
The pipeline wires task specs, agent harnesses, and environments into a reproducible evaluation loop.
The evaluation pipeline orchestrates three decoupled components—task specification, agent harness, and environment—so that any agent can be tested on any task without code changes.
An Agent Harness bundles a foundation model with the plumbing that lets it act on a virtual machine—sending clicks, typing, invoking tools, and processing feedback.
How does an Agent Harness differ from a simple script runner that executes commands on a VM?
The harness adds a language model that decides *what* to do, maintains a persistent context, and can invoke heterogeneous tools (GUI, API, file system) on the fly, whereas a script runner follows a static sequence of commands without reasoning or adaptive feedback.
The Evaluation Pipeline stitches together a Task Specification, an Agent Harness, and a remote Environment so that any agent can be evaluated on any professional workflow without rewriting code.
What would break if the load() function omitted the compute‑resource declaration?
Without declaring resources, the provisioning step cannot guarantee a deterministic VM configuration; subsequent start() may allocate insufficient CPU or memory, causing the agent’s actions to fail or produce nondeterministic results, which invalidates the scoring.
The Agent Harness receives the task description, builds a system prompt, and enters an action loop that reads the VM’s screenshots, issues tool calls, and writes to
The Agent Harness reads the description, issues a
This minimal run shows how the three lifecycle functions coordinate resource declaration, deterministic provisioning, and deterministic scoring, even for a trivial task.
Agent action loop skeleton.
**Figure 6.** **Evaluation pipeline architecture.** Each benchmark instance is defined by a Task Specification (main.py) that orchestrates a three-phase lifecycle (load(), start(), evaluate()) over a remote virtual-machine environment. The agent (harness + model) receives only the task description and metadata, interacts with the environment through an action loop, and produces output artifacts that the specification scores against references or rubrics.
**Figure 8.** Typical GCUA harness architecture. The main agent loop (left) cycles through context building, LLM inference, action decision, tool execution, and overflow management. The system prompt builder, tool system (including GUI harness via MCP), sub-agents, and context compaction manager are shared across mainstream harness implementations.
**Figure 7.** **Agent capability taxonomy.** Five functional layers define an agent's operational surface. Generalist CUA-agents (GCUA) possess full capability across all layers; CLI-agents lack visual perception (Eyes); GUI-agents have limited orchestration, tool use, and runtime access (Body, Hands, Feet).
Table 1 (not reproduced here) reports full‑pass rates, scores, API costs, and token usage for a variety of agent harnesses evaluated on ALE.
Experimental Results
We report how model choice dominates performance variation across agent harnesses.
ALE supplies task instances taken from real professional workflows, and we evaluate them with agents configured as Generalist CUA‑agents (GCUA) that can see and act across all five functional layers.
Model choice drives far larger performance variation than harness choice.
Figure 17 shows an 18.0 pp spread when varying the backbone model, versus only 5.3–6.0 pp when varying the harness.
**Figure 12.** Model choice vs. harness choice. Each dot is one configuration; the vertical bracket shows the full range of overall pass rates. Varying the backbone model under a fixed harness (OpenClaw, 12 models) produces an 18.0 pp spread, roughly 3x the spread observed when varying the harness under a fixed backbone (5.3–6.0 pp).
**Figure 13.** Performance vs. resource consumption for mainstream agent harnesses. Each bubble represents one harness-backbone configuration from Table 1; bubble area is proportional to total token consumption. (a) Overall mean score vs. total API cost (configurations with available cost data). (b) Overall mean score vs. total wall-clock time (all 14 configurations). The ideal operating point is the upper-left corner of each panel (high score, low resource use).
**Figure 9.** **Experiment analysis overview.** (a) Domain-level mean scores for Opus 4.7 and GPT-5.5, each averaged over harnesses with completed runs on the selected public task set; the sparse transportation domain is omitted. (b) Tool-call mix for the best available table-backed configuration per harness. (c) Tool-call mix for backbone models under a fixed OpenClaw harness. (d) Failure root-cause taxonomy for failed Claude Code + Opus 4.7 public-task runs.
Performance spread is driven primarily by the choice of backbone model rather than the agent harness.
Representativeness Analysis
Current AI benchmarks are simplified; ALE supplies high‑fidelity, end‑to‑end professional workflows.
In this appendix we revisit the paper’s core claim—standard benchmarks are overly simplistic, while ALE offers a realistic, end‑to‑end evaluation of professional agents.
Public‑subset pass rates track full‑pool pass rates with a strong Pearson correlation.
Figure 16 (bubble chart) shows a tight linear relationship (Pearson $r=0.89$, $p<0.001$) between the two sets of scores.
**Table 1.** Main results on ALE. Each difficulty level reports the full-pass rate (Pass, %), the mean score (Score, %), total API cost (💵), total wall-clock time (🕒), and total token use (Tok.). The final Overall Pass Rate column reports the full-pass rate over all evaluated tasks in the three difficulty levels. “—” cost data not available. Superscript ± values denote score standard deviations estimated from three independent runs of the same task instance; due to compute budget constraints, only a subset of configurations include repeated runs. †Model uses an additional visual sub-agent for visual perception. The lower panel reports ALE-CLI, the Linux-only subset, comparing CLI agents alongside GCUA references (*).
Only 2.9 % of runs hit the five‑hour wall‑clock cap, and those runs score substantially lower.
Table 6 reports a mean score of 20.8 % for capped runs versus 27.7 % for runs that finished earlier.
**Table 6.** Timeout frequency by difficulty tier. Scores are mean normalized scores on the same 0 to 100 scale used in Table 1
**Table 7.** Timeout frequency by harness for harnesses with at least one run that reached the cap.
Failure analysis attributes 47 % of failures to flawed approaches, 31 % to knowledge gaps, and 22 % to execution errors.
Section D.3’s taxonomy breakdown reports these percentages across the classified failures.
Swapping the backbone model changes overall pass rate by up to 18 pp, whereas changing the harness shifts it by at most 6 pp.
Table 1’s “Model vs. Harness Effect” comparison shows a 18 pp spread for model swaps and a 6 pp (GPT‑5.5) / 5.3 pp (Opus 4.7) spread for harness swaps.
When the backbone is fixed to GPT‑5.5, harness variation yields a 6 pp pass‑rate range; with Opus 4.7 the range is 5.3 pp.
Table 1 reports the respective harness ranges for each backbone.
Higher API cost does not guarantee better scores; GPT‑5.5 with ALE‑Claw tops performance at modest cost, while Opus 4.7 spends 3.7× more for a lower score.
Cost and score columns in Table 1 illustrate this trade‑off.
Wall‑clock time is largely independent of score; the fastest configuration (Droid Opus 4.6) finishes in 23 h but scores only 27.3 %.
Time and score entries in Table 1 show the disparity.
Token consumption is not predictive of success; Cursor (GPT‑5.5) uses 156 M tokens yet matches the high‑scoring ALE‑Claw (GPT‑5.5) which consumes 1 350 M tokens.
Token usage and score columns in Table 1 support this observation.
**Figure.** 3D animation output view: this render frame shows the rigged singer character from the agent's Blender submission. The task was to reproduce the body motion from a reference video, so a single valid-looking frame is only partial evidence; timing, pose range, and replay consistency determine success.
**Figure 11.** Public-subset representativeness. Pass rate per taxonomy cluster on the public subset (x) vs. the full task pool (y) for Claude Code + Opus 4.7. Point size $\propto$ total task instances per cluster. The strong correlation (r=0.89) confirms the public subset is representative.
Benchmark Construction Details
Details the ALE benchmark construction, taxonomy, task pipeline, and example task cards.
The appendix enumerates the full author roster, then details how the ALE benchmark taxonomy is built from occupational data, refined into domain‑specific subdomains, and extended to cover emerging workflows. It also describes the five‑gate task construction pipeline and showcases representative task cards that illustrate the evaluation rubric and scoring outcomes.
Evaluation Architecture Details
Appendix C details the ALE evaluation pipeline, task specs, scoring modes, and agent harness.
The pipeline consists of three decoupled components: a task specification that encodes the expert submission, an agent (model + harness) that interacts with the environment, and the remote virtual‑machine environment that hosts the required software.
Task specifications are single main.py files exposing three lifecycle functions—load() declares description and compute needs, start() materialises a deterministic VM state, and evaluate() returns a normalized score in [0, 1] after comparing agent outputs to references.
The agent receives the task configuration, observes the VM via screenshots or file reads, selects an action (mouse click, keystroke, shell command, file edit, or API call), executes it, and repeats until it signals termination.
The environment enforces a four‑directory contract: input/ (read‑only assets), software/ (pre‑installed applications), output/ (writable deliverables), and reference/ (ground‑truth, never exposed to the agent).
All tasks run on GCP VMs; the default is a c4-standard-4 (4 vCPUs, 16 GB RAM). GPU‑heavy tasks use g2-standard-8 with an NVIDIA L4, while memory‑intensive workloads receive larger configurations as declared in load().
The decoupled design guarantees that any agent conforming to the action interface can be evaluated on any task, and the same task spec can be deployed on different back‑ends without modification.
Task specifications implement three deterministic phases: load() (purely declarative), start() (VM preparation via a session API), and evaluate() (artifact retrieval and scoring).
Scoring runs either host‑side (default) when artifacts are small enough to transfer, or VM‑side when the artifact requires on‑VM software (e.g., CAD kernels, large geometry).
Artifact modes span seven categories—exact/hashed values, structured tabular, geometric/spatial, visual appearance, behavioral/world state, free‑text/semantic, and executable artifact—each with a concrete comparison strategy.
Final scores are composed via one of four patterns: a hard gate followed by a continuous score, a weighted rubric, a binary checklist average, or pairwise file aggregation.
ALE prefers deterministic, code‑based judges (93.2 % of workflows). LLM‑as‑judge is allowed only for perceptual deliverables and is always framed as narrow yes/no probes.
Table 4 (Figure 15) shows that 88.5 % of scoring runs host‑side, while 11.5 % require VM‑side verifiers; judge types are 93.2 % deterministic and 6.8 % LLM‑based.
Reference isolation is enforced by keeping reference/ outside the agent’s workspace and by early‑returning 0.0 if any required reference path is missing.
Each main.py defines a VARIANTS tuple of task instances; the current release contains 960 workflows and 1 490 instances, with per‑instance scores averaged upward.
The agent harness runs a six‑phase loop: Context Building, LLM Call, Decide, Collect Tool Result, Overflow Check (with compaction), and Termination.
At initialization the harness builds a system prompt from modular components (identity, memory, tool guidance, runtime metadata, safety rules, domain skills).
Tools are unified under a taxonomy: Bash (shell), File (filesystem), GUI (CUA desktop actions), Web (search/fetch), and Planning/Delegation (sub‑agents, memory).
GUI‑as‑Tool is realized via a CUA MCP bridge exposing 14 desktop‑action tools (Table 5); these map the agent’s high‑level commands to concrete mouse/keyboard events.
Sub‑agents can be spawned for specialized tool subsets, enabling parallel exploration while keeping the parent context compact.
The context manager applies three compaction tiers: micro‑compaction of stale tool results, LLM‑based summarization of older dialogue, and hard truncation to respect model token limits.
ALE‑Claw isolates the OpenClaw loop, removes production‑grade scaffolding (schedulers, multi‑channel gateways, plugin framework), and rewrites the core in Python for seamless integration with the CUA stack.
**Figure.** Moldex3D CAE workflow: the agent is editing the packing-pressure curve for a four-cavity injection-mold simulation. This is the process-setup stage; scoring depends on completing the solver run and extracting the pressure, force, cycle-time, volume, and weight metrics into results.json.
**Figure.** Music-engraving workflow: Dorico is open on an orchestral score in print/export mode. The task required converting the audio brief into a readable full score and MIDI, then exporting the PDF, MIDI, and overview screenshot to the exact output paths.
**Figure.** VFX compositing workflow: DaVinci Resolve shows a timeline and preview monitor with the bird footage. The task required identifying the green-screen foreground, keying it, compositing it over the intended sky plate, and matching the reference frame.
**Figure.** Radiology adjudication workflow: MicroDicom displays a chest X-ray with DICOM metadata and annotation tools. The task required reviewing each case, comparing two reader boxes for atelectasis, and writing TSV decisions; the deliverable depends on visual adjudication for each case.
**Table 3.** Evaluation modes available to task workflow authors. Most ALE task workflows combine two or more modes (e.g., a behavioral gate with a geometric score).
**Table 4.** Distribution of judge type and execution locale across the open-sourced task workflows in the ALE reference task tree.
The table lists various tools categorized by their function, including Keyboard, Mouse, and Utility groups, along with descriptions of their operations.