AUTOMEDBENCH: Towards Medical Autoresearch with Agentic AI Models
Junqi Liu, Selena Song, Yuhan Wang, Jiawei Mao, Hardy Chen, Xiaoke Huang, Tianhao Qi, Pengfei Guo, Yucheng Tang, Yufan He, Can Zhao, Andriy Myronenko, Dong Yang, Daguang Xu, Yuyin Zhou
AUTOMEDBENCH evaluates autonomous agents on end-to-end medical-AI research workflows, exposing critical failures in validation and engineering.
How do current LLM-based agents perform on end-to-end medical research tasks, and where do their workflows break down?
Medical-AI research requires agents to plan experiments, configure environments, and validate results, but existing benchmarks only evaluate final outputs, leaving the causes of failure invisible. AUTOMEDBENCH introduces a unified five-stage research workflow—Plan, Setup, Validate, Inference, and Submit—that tracks agent progress through 24 diverse medical tasks while recording diagnostic error codes for every breakdown. Across thousands of runs, agents consistently struggle with validation and submission, with engineering-shaped errors accounting for the vast majority of failures despite high-quality domain knowledge.
Paper Primer
The benchmark forces agents to operate within a two-container sandbox: an agent container with network access and a writable workspace, and an isolated offline evaluator that holds private references. This structure ensures that agents cannot "cheat" by accessing ground-truth data, forcing them to demonstrate a complete, reproducible research pipeline from initial brief to final artifact submission.
Validation is the primary bottleneck in medical research agents.
Stage-level scoring across 24 tasks shows that S3 (Validate) consistently receives the lowest mean score, while S2 (Setup) is the strongest. Agents are significantly better at installing dependencies than at executing pilot checks to verify pipeline reliability before full-scale inference.
Engineering failures, not domain knowledge gaps, dominate agent breakdowns.
Post-run error analysis shows that E3 (Verification) and E5 (Submission) errors account for 37.7% and 38.1% of fired codes, respectively, while E1 (Task Understanding) errors occur in only 0.9% of cases. Runs with at least one fired error code score 48% lower on average than successful runs.
Why is a workflow-aware benchmark necessary for medical AI?
Final-output metrics obscure the research process; a low score could stem from a simple formatting error, a failed model load, or a fundamental misunderstanding of the medical task. By scoring the five-stage workflow, the benchmark identifies whether an agent failed due to poor planning, weak validation, or execution errors.
Does providing more detailed instructions (LITE tier) improve agent performance?
Not consistently. While some agents improve with more scaffolding, others perform worse, suggesting that excessive instructions can constrain agents into brittle, inefficient workflows rather than helping them navigate the research process.
Reliable medical-AI agents require more than just domain knowledge; they must possess robust engineering capabilities, specifically the ability to perform intermediate validation and recover from execution errors.
Introduction: The Need for Medical Agent Benchmarking
We expose the missing workflow evaluation gap and introduce AutoMedBench to measure stage-level performance.
Existing medical‑agent benchmarks focus on final task success, offering little insight into how agents behave throughout a research workflow. This obscures whether failures stem from poor planning, broken pipelines, or missed validation steps. By exposing the hidden process, we can pinpoint the exact stage where an agent collapses.
AutoMedBench treats each autonomous medical‑AI run as a five‑stage workflow and scores every stage as well as the final task outcome.
Shift from isolated prediction to end‑to‑end research is essential for reliable medical AI agents.
Benchmark Design and Workflow
We detail the benchmark’s task selection, workflow, and execution environment.
We include a task only if its public inputs can be provided to the agent, the ground‑truth references remain hidden, and the task fits the shared research protocol. Tasks requiring subjective judgment, long‑horizon clinical dialogue, or on‑the‑fly model adaptation are omitted.
The benchmark forces agents to follow a five‑stage research pipeline that mirrors real‑world medical studies, ensuring each intermediate decision is observable.
How does this workflow differ from typical benchmarks that only score the final output?
Standard benchmarks evaluate a single end‑result, ignoring how the model arrived there. Here the agent must explicitly demonstrate planning, environment setup, and validation, so the system is judged on the entire research process, not just the final prediction.
Curate tasks across the five research tracks (segmentation, enhancement, VQA, report generation, detection).
Apply inclusion criteria (public inputs, hidden references, workflow compatibility) and filter out excluded tasks.
Define the shared five‑stage workflow and embed it into the benchmark harness.
Implement post‑run error coding with categories E1–E5 to capture failure modes.
Set up a uniform execution environment: a single base LLM agent interacting with a code‑execution sandbox.
Enforce two‑container isolation and an inference‑only policy, disallowing any training or fine‑tuning during a run.
S1 Plan: the agent writes a plan.md selecting a UNet‑based method and outlines preprocessing steps.
S2 Setup: it installs the required PyTorch version, downloads the pretrained UNet weights, and verifies the input shape.
S3 Validate: the agent runs a pilot on a single slice, inspects the Dice score, and adjusts the preprocessing parameters.
S4 Inference: it processes the full volume, writes the segmentation masks to the workspace, and logs runtime.
S5 Submit: the agent creates a submission.json adhering to the schema and submits it; the evaluator then computes the final Task Score.
This concrete run shows how each workflow stage produces tangible artifacts that the benchmark can audit, turning an opaque black‑box model into a stepwise research process.
**Figure 2.** **AutoMedBench**: a workflow-aware benchmark for autonomous medical AI research. **Left**: Tasks are sourced from 20+ public challenges (e.g., KiTS19 [25]) spanning diverse modalities (CT, MRI, X-ray, ultrasound, video) and task types (segmentation, detection, VQA, report generation, and image enhancement). Each task provides a natural-language description and deliverable target, with two difficulty tiers: *Lite* (method, environment, and skill scaffolding provided) and *Standard* (agent selects method and environment autonomously with plan-only guidance). **Right**: Given data access and a task description, an AI agent conducts auto research via a shared S1–S5 workflow — Plan (understand task, select method), Setup (install dependencies, load models), Validate (run pilot case, inspect outputs, fix errors), Inference (run inference, write predictions), and Submit (verify schema, submit answers) — before scoring and evaluation. Each agent operates in an isolated container with a private workspace; shared data and skill files are readable, but access to other agents' workspaces, evaluation ground truth, and scoring rubrics is prohibited, with violations triggering a warning then termination.
Table 1 enumerates the 24 active tasks, and Table 2 formalizes the five‑stage workflow that all agents must follow.
Scoring Metrics and Task Formulation
Defines the task tuple, difficulty tiers, and scoring protocol for AutoMedBench.
The benchmark treats each research problem as a tuple T = (`D_pub`, `D_priv`, b, A, S, m, $\tau$), where the agent receives public data `D_pub` and a brief b, must emit an artifact A that conforms to schema S within the wall‑time $\tau$, and is scored against hidden references `D_priv` using metric m.
AutoMedBench separates what the agent ultimately produces (Task Score) from how well it follows the research workflow (Agentic Score), then averages the two to obtain a single benchmark number.
How does the Agentic Score differ from the Task Score?
Task Score evaluates only the final output against hidden references, whereas Agentic Score aggregates five separate judgments about how the agent planned, set up, validated, ran inference, and packaged its results. An agent can score high on Task but low on Agentic if it skips required workflow steps.
AutoMedBench defines two difficulty tiers that differ only in the amount of scaffolding provided in the task brief, keeping all other variables fixed.
In the LITE tier the brief supplies a concrete method, key dependencies, and stage‑specific hints; in the STANDARD tier the brief only constrains model families, leaving method selection and validation entirely to the agent.
Provide the agent with public data $D_{\text{pub}}$ and the task brief $b$.
Agent generates artifact $A$ that obeys the submission schema $S$ and finishes within the wall‑time limit $\tau$.
Evaluator computes the Task Score by applying the task‑specific metric $m(A, D_{\text{priv}})$.
Evaluator judges each workflow stage $S_1$–$S_5$ (LLM judges for $S_1$–$S_3$, deterministic checks for $S_4$–$S_5$).
Compute $AGENTIC$ as the weighted sum of the five stage scores.
Compute $OVERALL$ as the average of $AGENTIC$ and $TASK$.
The agent submits a mask achieving macro‑Dice $0.80$ → Task Score $0.80$.
Stage scores: $S_1=1.0$ (plan accepted), $S_2=0.9$ (dependencies resolved), $S_3=0.7$ (validation missed a corner case), $S_4=1.0$ (inference completed), $S_5=1.0$ (submission format correct).
Compute $AGENTIC = 0.25 \cdot 1.0 + 0.15 \cdot 0.9 + 0.35 \cdot 0.7 + 0.15 \cdot 1.0 + 0.10 \cdot 1.0 = 0.88$.
Overall score $= 0.5 \cdot 0.88 + 0.5 \cdot 0.80 = 0.84$ (84 %).
This concrete run shows how a modest validation slip ($S_3=0.7$) drags the overall benchmark down, even when the final artifact is strong.
**Figure 3.** **AUTOMEDBENCH scoring rubrics.** The overall score is the equal-weighted average of Task Score and Agentic Score ($\times 0.5$ each). **Task Score** is computed deterministically from agent predictions or answers. **Agentic Score** combines deterministic checks and LLM judge scores across the S1–S5 workflow stages, weighted as: S1 Plan (25%), S2 Setup (15%), S3 Validate (35%), S4 Inference (15%), and S5 Submit (10%). S1, S2, and S3 are evaluated as discrete scores via LLM judge (plan contents, dependency validation, and self-correction); S4 is continuous (completion rate, OOM/timeout); S5 is discrete (format and completeness check). Task-specific metrics (e.g., Macro-Dice) are scored continuously and folded into the Task Score.
Table 3 (referenced) enumerates the five error categories (E₁–E₅) used to diagnose failures during evaluation, ensuring transparent partial credit for intermediate progress.
Comparison with Existing Benchmarks
Describes how benchmark runs are organized, logged, and compared to prior work.
This section details how the AutoMedBench evaluation runs are structured, logged, and contrasted with earlier medical agent benchmarks.
Each evaluation cell pairs one agent with one task and one difficulty tier; with six agents, 24 tasks, and two tiers this yields 288 cells.
Segmentation tasks (KiTS19, PanTS Tumor, PanTS OAR, FeTA, AeroPath) use 20 runs because their longer horizons produce higher variance.
Every run logs stage scores, the derived task and overall scores, turn count, wall‑clock time, token usage, estimated inference cost, status, and the full conversation transcript.
Post‑run diagnostics enumerate fired error codes (E1–E5); recovery rate measures the fraction of runs with ≥2 errors that still achieve a task score.
Costs are normalized using the fixed‑rate snapshot from Table 11, ignoring prompt‑caching discounts, to enable fair resource accounting across agents.
Leaderboard and Performance Summary
Agents excel at workflow steps but lag on final task quality, with varied track performance.
Agents achieve higher Agentic Scores than Task Scores, revealing a systematic performance gap.
Figure 1 shows each model’s Agentic score exceeds its Task score by 5–9 points (e.g., Claude‑Opus‑4.6: Agentic 61.6 vs Task 55.3).
Agentic Score quantifies how far an agent progresses through the multi‑stage workflow, while Task Score evaluates the final medical output (segmentation mask, restored image, VQA answer, report, or detection).
**Figure 1.** **Overall leaderboard.** Overall, agentic, and task scores for the 6 evaluated agents. Agents are ranked by overall scores. The overall score averages the workflow-based agentic score and the held-out task score. Per-track leaderboards are in Figure 9.
**Table 4.** Per-track and overall leaderboard. Scores are averaged over all runs for the tasks and tiers within each track. Agent rows are ordered by the overall leaderboard rank, the overall column is shown on the right, and the highest score in each column is highlighted.
**Table 5.** **More scaffolding does not consistently improve agentic scores.** LITE and STANDARD use the same data, metric, time cap, scoring code, and submission schema, but LITE provides more detailed scaffolding. $\Delta$ reports the relative change from STANDARD to LITE, computed as (LITE - STANDARD)/STANDARD $\times$ 100. Green values indicate improvement under LITE; red values indicate a drop. See tier details in Table 13.
Workflow Bottlenecks and Stage Analysis
Stage‑level scores, scaffolding impact, and cost trade‑offs reveal where agents succeed or stall.
Agents consistently score lowest on the validation step, making it the primary choke point in the workflow.
Why does validation become the bottleneck while setup scores highest?
Setup mainly involves installing dependencies and launching a runnable environment—tasks that are largely deterministic. Validation, by contrast, requires the agent to judge data quality and model readiness, which is a much harder, open‑ended decision and therefore fails more often.
Figure 5 breaks down agent performance by workflow stage: the validation stage (S3) has the lowest mean score, whereas the setup stage (S2) tops the chart. This stage‑level view uncovers hidden divergences—agents with similar final scores can differ dramatically in where they fail, such as late‑stage inference versus early planning.
**Figure 5.** Step-level workflow scoring across agents. Scores are shown for the six evaluated agents at each workflow stage: S1 Plan, S2 Setup, S3 Validate, S4 Inference, and S5 Submit. The dashed line marks the mean score across agents for each stage. Setup is the strongest stage on average, while validation is the weakest, showing that agents are better at making pipelines runnable than at checking whether those pipelines are reliable before full inference and submission. A strong agent tends to perform consistently well across steps, as seen for Opus 4.6, whereas other agents show more uneven profiles, such as GPT-5.4.
Adding more task‑brief scaffolding (LITE tier) does not guarantee higher agentic scores.
Table 5 shows that four agents improve under LITE, while two regress; GPT‑5.4 suffers a 16.3 % relative drop.
Higher spending does not reliably translate into better performance. Figure 6 shows weak, track‑dependent cost‑performance correlations, with the strongest Pearson $r$ of 0.77 in segmentation and near‑zero correlation in VQA.
**Figure 6.** Higher cost does not reliably translate into better performance. Bars show the mean cost per run for each task track. Insets plot each agent’s track-level cost against score, with Pearson correlation $r$ summarizing the cost–performance relationship within that track. The weak and track-dependent correlations indicate that raw spending is not the main driver of success. The API-price snapshot is listed in Table 11.
Opus 4.6 tops the leaderboard with the highest overall score, yet it also incurs the greatest average cost per run. By contrast, GLM‑5 achieves the second‑best score while spending considerably less, illustrating that the best‑scoring agent is not always the most cost‑effective choice.
Error-Code Taxonomy and Failure Modes
Agents’ error patterns and recovery capabilities shape overall performance.
We examine the error codes logged after each run to understand which failure modes are most common and how they affect agent performance.
Error codes act like diagnostic flags that pinpoint which stage of the research workflow broke, letting us separate engineering mishaps from knowledge gaps.
Is the Error‑Code Taxonomy just a count of failures?
No — it categorizes failures by their root cause (setup, execution, verification, deliverable, or understanding), so two runs with the same number of errors can differ dramatically in where the breakdown occurs.
Runs with a fired error code suffer a large performance drop.
Runs with one error have a 48 % lower overall score than error‑free runs; runs with two or more errors stay in a low‑score regime.
**Figure 7.** Error codes can sharply derail a run. (a) Distribution of fired error-code types. (b) Mean overall score by the number of fired error codes in a run. Verification and submission errors dominate tagged failures. The first fired error produces a large score drop, and runs with two or more fired errors remain in a low-score regime.
**Figure 8.** Strong agents both avoid errors and recover from them. Left: total fired error-code counts across agents. Right: recovery rate after two or more fired error codes, defined as the percentage of such runs that still reach end-to-end completion and receive an evaluation score.
Related Work in Agent Evaluation
Defines the per‑task metrics used to compute scores in AUTOMEDBENCH.
Each benchmark track reports a single scalar in the interval [0, 1]; the scalar is the average of per‑case scores, and any missing or unreadable output receives a zero for that case.
Dice measures the overlap between a predicted mask and its ground‑truth mask; macro Dice first averages over all targets in a case, then averages those case‑level scores across the held‑out set.
SSIM evaluates perceptual similarity between a restored image and its private reference; the mean across cases yields the task score.
Accuracy counts a VQA prediction as correct only when it exactly matches the gold answer after normalisation.
Each report case is evaluated with a suite of text‑generation and entity‑extraction scores; the case score is the unweighted mean of those seven values.
mAP@0.5 measures how well predicted bounding boxes match ground‑truth boxes at an Intersection‑over‑Union threshold of 0.5, averaged over all object classes.
All per‑task scores are finally combined by averaging across the 24 tasks in the benchmark, yielding a single scalar that reflects overall agentic performance.
Lite vs Standard Difficulty Tiers
Comparison of Lite and Standard tiers clarifies how task briefs differ while everything else stays constant.
The ablation study isolates the effect of the task brief by offering two difficulty tiers: a “Lite” tier with a fully specified method and a “Standard” tier that leaves method selection to the agent.
Lite provides a ready‑made workflow; Standard forces the agent to choose and justify its own method within bounded families.
How does the Standard tier differ from Lite beyond simply omitting the method name?
Standard replaces the concrete method with high‑level guidance (e.g., “choose a model from family X”), requires the agent to resolve dependencies, design its own validation plan, and justify the chosen workflow in plan.md, whereas Lite pins the exact packages, scripts, and checkpoints.
Table 13 expands the same eight dimensions with concrete wording, confirming that the data, references, metrics, time limits, scoring code, and submission schema remain identical across tiers.
Scoring Rubric Details
The section details the multi‑stage scoring rubric used to evaluate segmentation workflows.
Table 7 enumerates the five‑stage scoring rubric applied to the segmentation task, indicating which items are judged by an LLM and which by deterministic checks.
**Table 7.** Segmentation workflow-scoring rubric details. S1–S3 use LLM judge scores from saved artifacts and execution traces. S4–S5 use deterministic checks from the evaluator.
The rubric splits a workflow into five stages and assigns a numeric score to each, then aggregates them into an overall task score.
The rubric follows the public segmentation evaluator¹ and the full codebase is hosted on GitHub².
Evaluated Model Configurations
Base models and their deployment characteristics for the benchmark.
The benchmark evaluates six base models spanning both proprietary APIs and open‑weight releases. Uniform evaluation settings ensure that differences in performance stem from the models themselves, not from variations in prompts, tools, or scoring.
Resource Usage and Cost Analysis
Resource use and pricing are broken down by agent and task track.
This section quantifies the computational and monetary resources consumed by each evaluated agent and by task track, and records the API pricing assumptions underlying the cost analysis.
**Table 9.** Average resource use per run. Time is wall-clock minutes, turns are conversational turns, tokens are total LLM tokens, and cost is normalized USD under the rate snapshot described in §3.
**Table 10.** Average resource use per run by task track. Values are averaged over agents, tiers, and task settings within each track, excluding Kimi and weighted by run count.
**Table 11.** API price snapshot. Input and output prices are USD per million tokens. We apply these fixed rates to all runs, with no prompt-cache discounts or negotiated discounts. Model IDs follow OpenRouter.
Detailed Workflow Requirements
Defines the five workflow steps and their scoring evidence.
Table 14 enumerates the five workflow stages, the concrete work expected at each stage, and the evidence the benchmark uses to score agents.
S1 — Plan: the agent must understand the task brief, target artifact, inputs, output format, and metric, then research feasible methods and select an approach. The plan.md file must list execution steps, expected outputs, and validation checks; scoring looks at the notes, method rationale, and consistency with the brief.
S2 — Setup: the agent installs dependencies, prepares the software environment, loads allowed pre‑trained weights or configures model‑inference APIs, and verifies data paths, scripts, and output directories. Scoring checks for successful commands, runnable scripts, and the presence of required packages and setup checks before validation.
S3 — Validate: the agent runs a pilot on a small public subset, inspects intermediate outputs for correct shape, format, and clinical plausibility, and fixes any pipeline errors before scaling. Evidence consists of pilot outputs, validation logs, and explicit validation notes in the execution trace.
S4 — Inference: the agent runs the chosen pipeline on the full evaluation input set and writes prediction files for every case. Scoring looks for debugging evidence and corrected reruns in the trace.
S5 — Submit: the agent verifies that saved predictions match the required submission schema and submits only the final artifacts to the evaluator. No specific evidence is required for this stage.
Error-Code Definitions
Defines post‑run error codes and outlines validation steps for completed runs.
The run first executes inference commands and generates outputs, then the harness checks output completeness, validates the schema, and finally the offline evaluator receives the submitted files.
This appendix defines the post‑run error codes recorded in the detailed report after the agent interaction ends.
E1 – Understanding error: the agent solves the wrong problem or selects an approach incompatible with the task objective, modality, metric, constraints, or required artifact.
E2 – Data/model setup error: the agent understands the task but fails to correctly access, prepare, load, or configure the necessary data, models, APIs, dependencies, or runtime resources.
E3 – Verification/recovery error: the run produces invalid intermediate or final outputs, yet the agent does not detect, validate, debug, or recover from the problem.
E4 – Implementation/execution error: the intended pipeline is plausible, but the agent’s code, commands, or processing logic fail during execution.
E5 – Deliverable/submission error: usable outputs exist or could have been produced, but the final artifacts are missing, incomplete, malformed, wrongly named, misplaced, or incompatible with the evaluator schema.
Typical evidence includes incorrect task interpretation, wrong paths or dependency conflicts, and failed sanity checks such as ignored logs or empty outputs.
Per-Task Scoring Details
Overview of the error‑code taxonomy and per‑task scoring breakdown.
The section introduces a five‑category error‑code taxonomy (E1–E5) that captures failures from choosing the wrong high‑level approach (E1) to final artifact packaging errors (E5).
Per‑task scoring reports three numbers for each agent—overall, agentic, and task scores—so that a model that excels overall but underperforms on a specific track can be identified.