Agents’ Last Exam

Q: What is Agents' Last Exam (ALE) and what does it contribute?

ALE is a benchmark of 1,490 expert-authored task instances spanning 55 digital industries, designed to evaluate Generalist Computer-Use Agents (GCUAs) on real professional workflows that require interleaving GUI interaction, CLI operations, and domain-specific software to produce verifiable deliverables. It fills an evaluation gap left by existing benchmarks that measure isolated capabilities rather than sustained, economically meaningful performance.

Q: What problem does ALE address and why does it matter?

ALE addresses the persistent gap between high AI benchmark scores and economically meaningful deployment, arguing that existing benchmarks measure abstract competence rather than the long-horizon, tool-intensive workflows that define professional economic output. The authors frame this as a GDP-relevant issue: agents cannot replace or augment human expertise in real industry settings if they are only evaluated on simplified, synthetic tasks.

Q: What is a Generalist Computer-Use Agent (GCUA) as defined in the paper?

A GCUA is a system capable of managing an entire action loop across five functional layers: Brain (reasoning), Eyes (GUI perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). ALE forces integration of all five layers by requiring agents to navigate real professional software environments.

Q: Why does ALE use expert-authored tasks rather than synthetic ones?

Synthetic tasks fail to capture the complexity and real-world messiness of professional practice. ALE sources tasks from projects that domain experts have already shipped, ensuring agents are evaluated on authentic, economically valuable workflows that require professional judgment.

Q: How large is the ALE benchmark and how is it structured?

ALE contains 960 workflows and 1,490 task instances across 55 digital industry subdomains. To mitigate benchmark contamination, only 150 of the 1,490 instances (approximately 10%) are publicly released, while the remaining tasks stay private and are periodically rotated into the public set.

Q: How does ALE score agent outputs given the diversity of professional deliverables?

ALE uses deterministic, code-based judges for 93.2% of workflows, employing checks such as hashed values, geometric distances, and structured comparisons for artifacts like CAD files, code, and rendered media. LLM-as-judge scoring is permitted only for perceptual deliverables and is always framed as narrow yes/no probes, accounting for 6.8% of scoring.

Q: What are the seven artifact scoring modes used in ALE?

The seven artifact modes are: exact/hashed values, structured tabular, geometric/spatial, visual appearance, behavioral/world state, free-text/semantic, and executable artifact, each with a concrete comparison strategy. Final scores are composed via one of four patterns: a hard gate followed by a continuous score, a weighted rubric, a binary checklist average, or pairwise file aggregation.

Q: How is the ALE evaluation pipeline architected?

The pipeline consists of three decoupled components: a task specification (a single main.py file with load(), start(), and evaluate() lifecycle functions), an agent (model plus harness), and a remote virtual-machine environment hosting the required software. This decoupled design allows any conforming agent to be evaluated on any task without code changes.

Q: What compute infrastructure does ALE use to run tasks?

All tasks run on Google Cloud Platform (GCP) VMs; the default configuration is a c4-standard-4 with 4 vCPUs and 16 GB RAM. GPU-heavy tasks use a g2-standard-8 with an NVIDIA L4, and memory-intensive workloads receive larger configurations as declared in the load() function.

Q: How does the agent harness work in ALE?

The agent harness runs a six-phase loop—Context Building, LLM Call, Decide, Collect Tool Result, Overflow Check (with compaction), and Termination—using a language model to decide what to do next while maintaining a persistent context and invoking heterogeneous tools (GUI, API, file system) adaptively. This distinguishes it from a simple script runner, which follows a static command sequence without reasoning or adaptive feedback.

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu

ALE is a benchmark of 1K+ expert-authored, long-horizon professional workflows designed to measure AI economic utility.

How do current AI agent systems perform on complex, multi-step professional workflows that require real-world software interaction?

Current AI benchmarks measure abstract competence but fail to capture the long-horizon, tool-intensive workflows that define professional economic output. The authors introduce Agents’ Last Exam (ALE), a benchmark of 1,490 expert-authored task instances across 55 digital industries, where agents must interleave GUI interaction, CLI operations, and domain-specific software to produce verifiable professional deliverables. Frontier agents currently struggle to clear these tasks, with the hardest tier recording near-zero pass rates and even the easiest tier remaining far from saturated.

Paper Primer

ALE targets the Generalist Computer-Use Agent (GCUA), a system capable of managing an entire action loop across five functional layers: Brain (reasoning), Eyes (GUI perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). The benchmark forces this integration by requiring agents to navigate real professional software environments rather than synthetic or CLI-only sandboxes.

ALE is significantly harder than existing CLI-focused benchmarks.

Codex with GPT-5.5 achieves an 82% pass rate on Terminal-Bench but records only a 25.2% overall pass rate on the ALE-CLI subset. The gap indicates that ALE’s professional workflows demand capabilities beyond those tested by current terminal-centric evaluations.

Frontier agent performance remains low on the benchmark's most difficult tier.

Across mainstream harness and backbone configurations, the average full pass rate on the "Last Exam" tier is 2.6%. Most mainstream agents, including Claude Code, record near-zero pass rates at this difficulty level.

Why does this benchmark rely on expert-authored tasks rather than synthetic ones?

Synthetic tasks often fail to capture the complexity and "real-world" messiness of professional practice. By sourcing tasks from projects experts have already shipped, ALE ensures that agents are evaluated on authentic, economically valuable workflows that require professional judgment.

How does ALE handle the heterogeneity of professional deliverables?

ALE avoids open-ended LLM-as-judge scoring. Instead, it uses deterministic checks (e.g., hashed values, geometric distances) or narrow, evidence-anchored rubrics to verify artifacts like CAD files, code, or rendered media.

Introduction and Motivation

We expose the evaluation gap between current benchmarks and real professional workflows, introducing ALE to bridge it.

Recent AI systems excel on many benchmarks, yet these gains have not translated into economically meaningful deployment. The core issue is an evaluation gap: existing benchmarks measure isolated capabilities rather than sustained performance on real professional workflows. Agents' Last Exam (ALE) is introduced to fill this gap by providing a high‑fidelity, environment‑based benchmark of long‑horizon, economically valuable tasks.

A Professional Workflow is a multi‑step process that intertwines GUI interaction, command‑line operations, and domain‑specific software to produce a verifiable deliverable. Measuring agents on such workflows reveals whether they can replace or augment human expertise in real industry settings.

**Figure 1.** **Agents' Last Exam** spans a broad taxonomy of professional tasks and realistic workflows.

**Figure 3.** Benchmark positioning map. Prior benchmarks are placed by mapping their published domains onto the ALE domain taxonomy.

The persistent gap between benchmark scores and economic utility underscores why ALE’s real‑world focus is essential for translating AI progress into GDP‑relevant impact.

Benchmark Design and Taxonomy

Defines the ALE benchmark’s task criteria, taxonomy, construction pipeline, and release strategy.

The benchmark is built around three non‑negotiable requirements that filter which professional workflows become ALE task instances.

ALE curates full‑workflow professional tasks that use the same software a domain expert would, are substantial end‑to‑end deliverables, and produce outputs that can be checked automatically or with a clear rubric.

**Figure 2.** Distribution of 1,490 task instances across the ALE taxonomy. Each row is one of the 55 subdomains, grouped under the 13 top-level domains (parenthetical numbers give domain totals). Stacked bars decompose each subdomain into fully-implemented instances (domain color) and expert submissions awaiting Quality Control (QC) Process (orange). All 55 subdomains receive non-zero coverage. Current runnable task instances target either Linux or Windows virtual machines.

To make the taxonomy comparable across benchmarks, the authors map each prior benchmark’s categories onto the same 55‑subdomain scheme using an LLM‑assisted classifier, revealing a 13‑subdomain gap that no existing suite covers.

**Figure 4.** Task construction pipeline. Tasks proceed from expert sourcing through submission, first-pass review, engineering implementation, and final quality control.

Expert outreach recruits domain specialists via an advisory committee to ensure coverage of all ALE subdomains.

Experts submit proposals through a dedicated web portal, providing a natural‑language description, input files, target software, expected deliverable, and evaluation specification.

First‑pass review assigns conference‑style decisions (major/minor revision, borderline accept, accept, strong accept) and routes required revisions back to the expert.

Engineers convert accepted specifications into runnable containers, perform dry‑runs, and feed any gaps back to the expert for clarification.

The QC committee conducts a peer review of reference outputs, calibrates evaluation bounds, and validates contextual sufficiency before admitting the task to the benchmark.

The portal records all five components and flags missing fields.

Auto‑review runs a lightweight validator that checks file format compatibility and software version availability.

Engineers spin up a SolidWorks container, execute the conversion script, and verify that the generated STL hash matches the reference.

QC reviewers confirm that the reference STL was produced by a senior CAD engineer and that the hash comparison is deterministic.

The task is approved and added to the private pool of ALE instances.

This concrete walk‑through shows how each of the five required components is validated before a task reaches the public benchmark.

**Figure 5.** Provenance and review yield. The 1,490 task instances split into 960 external submissions (top, by first-pass review verdict) and 530 commissioned tasks (bottom). Each bar is segmented by release state: 150 public, 1,017 private and 323 unverified pending QC.

ALE mitigates benchmark contamination by exposing only 150 of the 1,490 task instances (≈10 %) publicly; the remaining tasks stay private and are periodically rotated into the public set while retired tasks are retired, preserving an uncontaminated evaluation surface.

**Figure 10.** Software ecosystem covered by ALE tasks. Each icon is a distinct application or toolchain that appears in at least one task workflow, positioned within its primary ALE domain. Overlap regions hold tools that span multiple domains (e.g., creative-suite applications shared between Visual & Media Arts and Engineering). The figure is qualitative; quantitative per-subdomain instance counts appear in Figure 2.

The Evaluation Pipeline

The pipeline wires task specs, agent harnesses, and environments into a reproducible evaluation loop.

The evaluation pipeline orchestrates three decoupled components—task specification, agent harness, and environment—so that any agent can be tested on any task without code changes.

An Agent Harness bundles a foundation model with the plumbing that lets it act on a virtual machine—sending clicks, typing, invoking tools, and processing feedback.

How does an Agent Harness differ from a simple script runner that executes commands on a VM?

The harness adds a language model that decides *what* to do, maintains a persistent context, and can invoke heterogeneous tools (GUI, API, file system) on the fly, whereas a script runner follows a static sequence of commands without reasoning or adaptive feedback.

The Evaluation Pipeline stitches together a Task Specification, an Agent Harness, and a remote Environment so that any agent can be evaluated on any professional workflow without rewriting code.

What would break if the load() function omitted the compute‑resource declaration?

Without declaring resources, the provisioning step cannot guarantee a deterministic VM configuration; subsequent start() may allocate insufficient CPU or memory, causing the agent’s actions to fail or produce nondeterministic results, which invalidates the scoring.

The Agent Harness receives the task description, builds a system prompt, and enters an action loop that reads the VM’s screenshots, issues tool calls, and writes to

The Agent Harness reads the description, issues a

This minimal run shows how the three lifecycle functions coordinate resource declaration, deterministic provisioning, and deterministic scoring, even for a trivial task.

Agent action loop skeleton.

**Figure 6.** **Evaluation pipeline architecture.** Each benchmark instance is defined by a Task Specification (main.py) that orchestrates a three-phase lifecycle (load(), start(), evaluate()) over a remote virtual-machine environment. The agent (harness + model) receives only the task description and metadata, interacts with the environment through an action loop, and produces output artifacts that the specification scores against references or rubrics.

**Figure 8.** Typical GCUA harness architecture. The main agent loop (left) cycles through context building, LLM inference, action decision, tool execution, and overflow management. The system prompt builder, tool system (including GUI harness via MCP), sub-agents, and context compaction manager are shared across mainstream harness implementations.

**Figure 7.** **Agent capability taxonomy.** Five functional layers define an agent's operational surface. Generalist CUA-agents (GCUA) possess full capability across all layers; CLI-agents lack visual perception (Eyes); GUI-agents have limited orchestration, tool use, and runtime access (Body, Hands, Feet).

Table 1 (not reproduced here) reports full‑pass rates, scores, API costs, and token usage for a variety of agent harnesses evaluated on ALE.

Experimental Results

We report how model choice dominates performance variation across agent harnesses.

ALE supplies task instances taken from real professional workflows, and we evaluate them with agents configured as Generalist CUA‑agents (GCUA) that can see and act across all five functional layers.

Model choice drives far larger performance variation than harness choice.

Figure 17 shows an 18.0 pp spread when varying the backbone model, versus only 5.3–6.0 pp when varying the harness.

**Figure 12.** Model choice vs. harness choice. Each dot is one configuration; the vertical bracket shows the full range of overall pass rates. Varying the backbone model under a fixed harness (OpenClaw, 12 models) produces an 18.0 pp spread, roughly 3x the spread observed when varying the harness under a fixed backbone (5.3–6.0 pp).

**Figure 13.** Performance vs. resource consumption for mainstream agent harnesses. Each bubble represents one harness-backbone configuration from Table 1; bubble area is proportional to total token consumption. (a) Overall mean score vs. total API cost (configurations with available cost data). (b) Overall mean score vs. total wall-clock time (all 14 configurations). The ideal operating point is the upper-left corner of each panel (high score, low resource use).

**Figure 9.** **Experiment analysis overview.** (a) Domain-level mean scores for Opus 4.7 and GPT-5.5, each averaged over harnesses with completed runs on the selected public task set; the sparse transportation domain is omitted. (b) Tool-call mix for the best available table-backed configuration per harness. (c) Tool-call mix for backbone models under a fixed OpenClaw harness. (d) Failure root-cause taxonomy for failed Claude Code + Opus 4.7 public-task runs.

Performance spread is driven primarily by the choice of backbone model rather than the agent harness.

Representativeness Analysis

Current AI benchmarks are simplified; ALE supplies high‑fidelity, end‑to‑end professional workflows.

In this appendix we revisit the paper’s core claim—standard benchmarks are overly simplistic, while ALE offers a realistic, end‑to‑end evaluation of professional agents.

Public‑subset pass rates track full‑pool pass rates with a strong Pearson correlation.

Figure 16 (bubble chart) shows a tight linear relationship (Pearson $r=0.89$, $p<0.001$) between the two sets of scores.

**Table 1.** Main results on ALE. Each difficulty level reports the full-pass rate (Pass, %), the mean score (Score, %), total API cost (💵), total wall-clock time (🕒), and total token use (Tok.). The final Overall Pass Rate column reports the full-pass rate over all evaluated tasks in the three difficulty levels. “—” cost data not available. Superscript ± values denote score standard deviations estimated from three independent runs of the same task instance; due to compute budget constraints, only a subset of configurations include repeated runs. †Model uses an additional visual sub-agent for visual perception. The lower panel reports ALE-CLI, the Linux-only subset, comparing CLI agents alongside GCUA references (*).

Only 2.9 % of runs hit the five‑hour wall‑clock cap, and those runs score substantially lower.

Table 6 reports a mean score of 20.8 % for capped runs versus 27.7 % for runs that finished earlier.

**Table 6.** Timeout frequency by difficulty tier. Scores are mean normalized scores on the same 0 to 100 scale used in Table 1

**Table 7.** Timeout frequency by harness for harnesses with at least one run that reached the cap.

Failure analysis attributes 47 % of failures to flawed approaches, 31 % to knowledge gaps, and 22 % to execution errors.

Section D.3’s taxonomy breakdown reports these percentages across the classified failures.

Swapping the backbone model changes overall pass rate by up to 18 pp, whereas changing the harness shifts it by at most 6 pp.

Table 1’s “Model vs. Harness Effect” comparison shows a 18 pp spread for model swaps and a 6 pp (GPT‑5.5) / 5.3 pp (Opus 4.7) spread for harness swaps.

When the backbone is fixed to GPT‑5.5, harness variation yields a 6 pp pass‑rate range; with Opus 4.7 the range is 5.3 pp.

Table 1 reports the respective harness ranges for each backbone.

Higher API cost does not guarantee better scores; GPT‑5.5 with ALE‑Claw tops performance at modest cost, while Opus 4.7 spends 3.7× more for a lower score.

Cost and score columns in Table 1 illustrate this trade‑off.

Wall‑clock time is largely independent of score; the fastest configuration (Droid Opus 4.6) finishes in 23 h but scores only 27.3 %.

Time and score entries in Table 1 show the disparity.

Token consumption is not predictive of success; Cursor (GPT‑5.5) uses 156 M tokens yet matches the high‑scoring ALE‑Claw (GPT‑5.5) which consumes 1 350 M tokens.

Token usage and score columns in Table 1 support this observation.

**Figure.** 3D animation output view: this render frame shows the rigged singer character from the agent's Blender submission. The task was to reproduce the body motion from a reference video, so a single valid-looking frame is only partial evidence; timing, pose range, and replay consistency determine success.

**Figure 11.** Public-subset representativeness. Pass rate per taxonomy cluster on the public subset (x) vs. the full task pool (y) for Claude Code + Opus 4.7. Point size $\propto$ total task instances per cluster. The strong correlation (r=0.89) confirms the public subset is representative.

Benchmark Construction Details

Details the ALE benchmark construction, taxonomy, task pipeline, and example task cards.

The appendix enumerates the full author roster, then details how the ALE benchmark taxonomy is built from occupational data, refined into domain‑specific subdomains, and extended to cover emerging workflows. It also describes the five‑gate task construction pipeline and showcases representative task cards that illustrate the evaluation rubric and scoring outcomes.

Evaluation Architecture Details

Appendix C details the ALE evaluation pipeline, task specs, scoring modes, and agent harness.

The pipeline consists of three decoupled components: a task specification that encodes the expert submission, an agent (model + harness) that interacts with the environment, and the remote virtual‑machine environment that hosts the required software.

Task specifications are single main.py files exposing three lifecycle functions—load() declares description and compute needs, start() materialises a deterministic VM state, and evaluate() returns a normalized score in [0, 1] after comparing agent outputs to references.

The agent receives the task configuration, observes the VM via screenshots or file reads, selects an action (mouse click, keystroke, shell command, file edit, or API call), executes it, and repeats until it signals termination.

The environment enforces a four‑directory contract: input/ (read‑only assets), software/ (pre‑installed applications), output/ (writable deliverables), and reference/ (ground‑truth, never exposed to the agent).

All tasks run on GCP VMs; the default is a c4-standard-4 (4 vCPUs, 16 GB RAM). GPU‑heavy tasks use g2-standard-8 with an NVIDIA L4, while memory‑intensive workloads receive larger configurations as declared in load().

The decoupled design guarantees that any agent conforming to the action interface can be evaluated on any task, and the same task spec can be deployed on different back‑ends without modification.

Task specifications implement three deterministic phases: load() (purely declarative), start() (VM preparation via a session API), and evaluate() (artifact retrieval and scoring).

Scoring runs either host‑side (default) when artifacts are small enough to transfer, or VM‑side when the artifact requires on‑VM software (e.g., CAD kernels, large geometry).

Artifact modes span seven categories—exact/hashed values, structured tabular, geometric/spatial, visual appearance, behavioral/world state, free‑text/semantic, and executable artifact—each with a concrete comparison strategy.

Final scores are composed via one of four patterns: a hard gate followed by a continuous score, a weighted rubric, a binary checklist average, or pairwise file aggregation.

ALE prefers deterministic, code‑based judges (93.2 % of workflows). LLM‑as‑judge is allowed only for perceptual deliverables and is always framed as narrow yes/no probes.

Table 4 (Figure 15) shows that 88.5 % of scoring runs host‑side, while 11.5 % require VM‑side verifiers; judge types are 93.2 % deterministic and 6.8 % LLM‑based.

Reference isolation is enforced by keeping reference/ outside the agent’s workspace and by early‑returning 0.0 if any required reference path is missing.

Each main.py defines a VARIANTS tuple of task instances; the current release contains 960 workflows and 1 490 instances, with per‑instance scores averaged upward.

The agent harness runs a six‑phase loop: Context Building, LLM Call, Decide, Collect Tool Result, Overflow Check (with compaction), and Termination.

At initialization the harness builds a system prompt from modular components (identity, memory, tool guidance, runtime metadata, safety rules, domain skills).

Tools are unified under a taxonomy: Bash (shell), File (filesystem), GUI (CUA desktop actions), Web (search/fetch), and Planning/Delegation (sub‑agents, memory).

GUI‑as‑Tool is realized via a CUA MCP bridge exposing 14 desktop‑action tools (Table 5); these map the agent’s high‑level commands to concrete mouse/keyboard events.

Sub‑agents can be spawned for specialized tool subsets, enabling parallel exploration while keeping the parent context compact.

The context manager applies three compaction tiers: micro‑compaction of stale tool results, LLM‑based summarization of older dialogue, and hard truncation to respect model token limits.

ALE‑Claw isolates the OpenClaw loop, removes production‑grade scaffolding (schedulers, multi‑channel gateways, plugin framework), and rewrites the core in Python for seamless integration with the CUA stack.

**Figure.** Moldex3D CAE workflow: the agent is editing the packing-pressure curve for a four-cavity injection-mold simulation. This is the process-setup stage; scoring depends on completing the solver run and extracting the pressure, force, cycle-time, volume, and weight metrics into results.json.

**Figure.** Music-engraving workflow: Dorico is open on an orchestral score in print/export mode. The task required converting the audio brief into a readable full score and MIDI, then exporting the PDF, MIDI, and overview screenshot to the exact output paths.

**Figure.** VFX compositing workflow: DaVinci Resolve shows a timeline and preview monitor with the bird footage. The task required identifying the green-screen foreground, keying it, compositing it over the intended sky plate, and matching the reference frame.

**Figure.** Radiology adjudication workflow: MicroDicom displays a chest X-ray with DICOM metadata and annotation tools. The task required reviewing each case, comparing two reader boxes for atelectasis, and writing TSV decisions; the deliverable depends on visual adjudication for each case.

**Table 3.** Evaluation modes available to task workflow authors. Most ALE task workflows combine two or more modes (e.g., a behavioral gate with a geometric score).

**Table 4.** Distribution of judge type and execution locale across the open-sourced task workflows in the ALE reference task tree.

The table lists various tools categorized by their function, including Keyboard, Mouse, and Utility groups, along with descriptions of their operations.

Questions & answers

What is Agents' Last Exam (ALE) and what does it contribute?

ALE is a benchmark of 1,490 expert-authored task instances spanning 55 digital industries, designed to evaluate Generalist Computer-Use Agents (GCUAs) on real professional workflows that require interleaving GUI interaction, CLI operations, and domain-specific software to produce verifiable deliverables. It fills an evaluation gap left by existing benchmarks that measure isolated capabilities rather than sustained, economically meaningful performance.

What problem does ALE address and why does it matter?

ALE addresses the persistent gap between high AI benchmark scores and economically meaningful deployment, arguing that existing benchmarks measure abstract competence rather than the long-horizon, tool-intensive workflows that define professional economic output. The authors frame this as a GDP-relevant issue: agents cannot replace or augment human expertise in real industry settings if they are only evaluated on simplified, synthetic tasks.

What is a Generalist Computer-Use Agent (GCUA) as defined in the paper?

A GCUA is a system capable of managing an entire action loop across five functional layers: Brain (reasoning), Eyes (GUI perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). ALE forces integration of all five layers by requiring agents to navigate real professional software environments.

Why does ALE use expert-authored tasks rather than synthetic ones?

Synthetic tasks fail to capture the complexity and real-world messiness of professional practice. ALE sources tasks from projects that domain experts have already shipped, ensuring agents are evaluated on authentic, economically valuable workflows that require professional judgment.

How large is the ALE benchmark and how is it structured?

ALE contains 960 workflows and 1,490 task instances across 55 digital industry subdomains. To mitigate benchmark contamination, only 150 of the 1,490 instances (approximately 10%) are publicly released, while the remaining tasks stay private and are periodically rotated into the public set.

How does ALE score agent outputs given the diversity of professional deliverables?

ALE uses deterministic, code-based judges for 93.2% of workflows, employing checks such as hashed values, geometric distances, and structured comparisons for artifacts like CAD files, code, and rendered media. LLM-as-judge scoring is permitted only for perceptual deliverables and is always framed as narrow yes/no probes, accounting for 6.8% of scoring.

What are the seven artifact scoring modes used in ALE?

The seven artifact modes are: exact/hashed values, structured tabular, geometric/spatial, visual appearance, behavioral/world state, free-text/semantic, and executable artifact, each with a concrete comparison strategy. Final scores are composed via one of four patterns: a hard gate followed by a continuous score, a weighted rubric, a binary checklist average, or pairwise file aggregation.

How is the ALE evaluation pipeline architected?

The pipeline consists of three decoupled components: a task specification (a single main.py file with load(), start(), and evaluate() lifecycle functions), an agent (model plus harness), and a remote virtual-machine environment hosting the required software. This decoupled design allows any conforming agent to be evaluated on any task without code changes.

What compute infrastructure does ALE use to run tasks?

All tasks run on Google Cloud Platform (GCP) VMs; the default configuration is a c4-standard-4 with 4 vCPUs and 16 GB RAM. GPU-heavy tasks use a g2-standard-8 with an NVIDIA L4, and memory-intensive workloads receive larger configurations as declared in the load() function.

How does the agent harness work in ALE?

The agent harness runs a six-phase loop—Context Building, LLM Call, Decide, Collect Tool Result, Overflow Check (with compaction), and Termination—using a language model to decide what to do next while maintaining a persistent context and invoking heterogeneous tools (GUI, API, file system) adaptively. This distinguishes it from a simple script runner, which follows a static command sequence without reasoning or adaptive feedback.

What tools are available to agents in ALE?

Tools are unified under five categories: Bash (shell), File (filesystem), GUI (desktop actions via a CUA MCP bridge exposing 14 desktop-action tools), Web (search/fetch), and Planning/Delegation (sub-agents, memory). Sub-agents can be spawned for specialized tool subsets, enabling parallel exploration while keeping the parent context compact.

What are the key experimental results reported for frontier agents on ALE?

Frontier agents currently struggle to clear ALE tasks: the hardest tier records near-zero pass rates, and even the easiest tier remains far from saturated. The paper reports that performance spread is driven primarily by the choice of backbone model rather than the agent harness, though specific numerical pass rates per model are referenced in Table 1 but not reproduced in the provided text.

How does ALE compare to prior benchmarks in terms of domain coverage?

The authors map prior benchmarks' categories onto ALE's 55-subdomain taxonomy using an LLM-assisted classifier and find a 13-subdomain gap that no existing benchmark suite covers. This analysis supports the paper's claim that standard benchmarks are overly simplistic relative to ALE's professional scope.

What are the limitations or open challenges acknowledged in the paper?

The paper does not explicitly enumerate a dedicated limitations section in the provided text, but implicitly acknowledges that near-zero pass rates on the hardest tier mean the benchmark is currently far from being solved, and that benchmark contamination is an ongoing concern addressed only partially by keeping 90% of tasks private and rotating them periodically.

How does ALE prevent agents from accessing ground-truth reference data during evaluation?

The environment enforces a four-directory contract—input/ (read-only assets), software/ (pre-installed applications), output/ (writable deliverables), and reference/ (ground-truth, never exposed to the agent)—and the evaluate() function early-returns a score of 0.0 if any required reference path is missing. Reference isolation is enforced by keeping the reference/ directory outside the agent's workspace.

What is the three-gate task construction pipeline used to build ALE?

The paper refers to a five-gate task construction pipeline (not three-gate) that filters which professional workflows become ALE task instances, though the specific gates are described as enumerated in the appendix rather than detailed in the main text provided. The paper does not reproduce the full gate criteria in the excerpted content.

What is ALE-Claw and how does it relate to the benchmark?

ALE-Claw is a stripped-down version of the OpenClaw agent loop that removes production-grade scaffolding such as schedulers, multi-channel gateways, and plugin frameworks, and rewrites the core in Python for seamless integration with the CUA stack. It serves as one of the agent harnesses evaluated on ALE.

Where and when was ALE published, and who are the authors?

ALE is available as an arXiv preprint at arxiv.org/abs/2606.05405. The paper states that the full author roster is enumerated in the appendix, but the provided text does not list individual author names or a publication venue beyond arXiv.

Key terms

ALE (Agents' Last Exam): A benchmark of 1,490 expert-authored task instances across 55 digital industries designed to evaluate AI agents on authentic, long-horizon professional workflows.
GCUA (Generalist Computer-Use Agent): An AI system capable of managing a full action loop across reasoning, GUI perception, orchestration, tool invocation, and runtime substrate layers to complete professional tasks.
Professional Workflow: A multi-step process that intertwines GUI interaction, command-line operations, and domain-specific software to produce a verifiable professional deliverable.
Agent Harness: A software wrapper that pairs a language model with a persistent context and heterogeneous tool access, enabling adaptive decision-making during task execution rather than static script execution.
CUA MCP Bridge: A software interface that exposes 14 desktop-action tools, translating an agent's high-level GUI commands into concrete mouse and keyboard events on a virtual machine.
LLM-as-Judge: A scoring approach where a large language model evaluates agent outputs, used in ALE only for perceptual deliverables and restricted to narrow yes/no probes to minimize subjectivity.
Deterministic Judge: A code-based scoring function that compares agent outputs to reference artifacts using fixed rules such as hash matching or geometric distance, without relying on a language model.
Task Specification (main.py): A single Python file defining three lifecycle functions—load(), start(), and evaluate()—that declare compute needs, prepare the VM state, and score agent outputs respectively.
load(): The declarative lifecycle function in a task specification that announces the task description and required compute resources before the VM is provisioned.
start(): The lifecycle function that materializes a deterministic virtual machine state, installing files and software needed for the agent to begin working on a task.
evaluate(): The lifecycle function that retrieves agent-produced artifacts and returns a normalized score between 0 and 1 by comparing them to reference outputs.
Reference Isolation: A security design principle in ALE that keeps ground-truth reference files in a directory never accessible to the agent, preventing the agent from trivially copying correct answers.
Context Compaction: A three-tier process in the agent harness—micro-compaction of stale tool results, LLM-based summarization of older dialogue, and hard truncation—used to keep the agent's running context within model token limits.
Benchmark Contamination: The risk that an AI model has been trained on benchmark task data, inflating its apparent performance; ALE mitigates this by keeping 90% of tasks private and rotating them periodically.
GUI (Graphical User Interface) Interaction: Agent actions that involve clicking, typing, or otherwise operating visual desktop applications, as opposed to issuing text-based command-line instructions.
CLI (Command-Line Interface) Operations: Agent actions that involve issuing text commands in a shell or terminal to manipulate files, run programs, or query system state.
ALE-Claw: A simplified Python reimplementation of the OpenClaw agent loop, stripped of production scaffolding, used as one of the agent harnesses evaluated on ALE.
Artifact Mode: One of seven categories describing the type of output an agent must produce (e.g., hashed value, geometric file, visual media), each associated with a specific comparison strategy for scoring.
VM-side Scoring: A scoring approach where artifact evaluation runs inside the virtual machine rather than on the host, required when the comparison needs on-VM software such as CAD kernels or large geometry processors.
Sub-agent: A specialized child agent spawned by a parent agent harness to handle a particular tool subset or subtask in parallel, keeping the parent's context compact.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers