When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin

TOOLMAZE evaluates LLM agent robustness by injecting structured tool failures into DAG-based task workflows.

How do LLM agents perform when tools fail, and can they dynamically recover by retrying or switching paths?

LLM agents are typically benchmarked on "happy path" scenarios, leaving them vulnerable to real-world tool failures like network timeouts or semantically corrupted data. TOOLMAZE shifts evaluation to a two-dimensional grid: it maps tasks onto Directed Acyclic Graphs (DAGs) of varying complexity and injects four distinct perturbation modes—explicit/implicit and transient/permanent—at pre-specified nodes. Experiments show that agents struggle significantly with implicit semantic errors, with recovery rates plummeting by over 37% compared to explicit failures, revealing that fault tolerance does not scale naturally with model size.

Paper Primer

The core mechanism is a deterministic perturbation engine that intercepts tool calls at specific nodes in a pre-generated DAG. By forcing agents to navigate these controlled obstacles, the framework forces a transition from linear execution to active state-space exploration, where the agent must detect anomalies and re-plan using alternative tool paths.

Implicit semantic failures represent a fundamental bottleneck for agentic robustness.

The Perturbation Recovery Rate (PRR) gap between explicit and implicit failures averages 37.15% across models. PRR for implicit permanent failures drops to 17.58%, compared to 38.12% for explicit permanent failures.

Fault tolerance is a distinct capability that does not emerge from general scaling.

Log-linear analysis shows that while baseline task success grows by 17.85 percentage points per order of magnitude in parameter count, fault tolerance (PRR) grows by only 4.88 percentage points. Scaling provides roughly 3.66x more gain in basic task completion than in recovery capability.

Why is a DAG-based approach necessary for this benchmark?

Standard benchmarks lack the structural ground truth to distinguish between systematic replanning and lucky trial-and-error. By using DAGs, TOOLMAZE defines a complete space of valid recovery paths, allowing for precise measurement of whether an agent is actually solving the problem or just guessing.

What is the difference between explicit and implicit failures in this context?

Explicit failures are machine-readable exceptions (e.g., HTTP 404) that clearly block execution. Implicit failures return structurally valid but semantically incorrect data (e.g., a negative stock count), which agents often blindly propagate, leading to cascading logic errors.

Abstract

We present TOOLMAZE, a benchmark exposing tool‑failure challenges for LLM agents.

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) on idealized “happy paths,” largely overlooking real‑world tool failures. To expose this gap we introduce TOOLMAZE, a benchmark that tests dynamic path discovery and error recovery in TIR agents. TOOLMAZE varies tasks along two axes: DAG‑based topological complexity and a taxonomy of tool perturbations (explicit vs. implicit, transient vs. permanent).

Across nearly all models, perturbations degrade performance, with the sharpest drops under implicit semantic failures. Perturbation Recovery Rate (PRR) falls by roughly 37% in these scenarios, revealing systemic over‑trust in corrupted outputs. Complex topologies further trap agents in futile trial‑and‑error loops, slowing fault‑tolerance improvement relative to basic task execution.

These findings highlight dynamic re‑planning as a distinct bottleneck that scaling or prompting alone does not resolve. All data and code are released publicly.

The Tool-Use Reliability Gap

Identifying why current tool‑using LLM benchmarks miss real‑world failures.

Existing benchmarks evaluate LLM agents only on idealized “happy paths,” ignoring the frequent explicit and implicit tool failures that occur in practice. This blind spot raises a crucial question: how resilient are agents when tools misbehave or return corrupted data?

TIR agents treat external tools as active problem‑solving components rather than static knowledge look‑ups, enabling the model to issue calls, consume results, and continue reasoning.

**Figure 1.** An illustrative example of agent behavior under tool failure. The unstable agent aborts the task after an endless retry loop, whereas the robust agent wisely bypasses repeated failures by switching to an alternative tool.

The shift from static knowledge to dynamic tool‑use requires new robustness benchmarks.

Related Work

Survey of recent tool‑use benchmarks and robustness studies.

Early work (Guo et al., 2024; Li et al., 2023; Qin et al., 2023) demonstrated that large language models can invoke external tools, establishing the baseline for tool‑use abilities.

More recent paradigms (Lu et al., 2025; Froger et al., 2025; Wang et al., 2025; Wölflein et al., 2025; Wijk et al., 2024; Cai et al., 2025) target stateful, open‑ended environments, and benchmarks such as Multi‑Mission Tool Bench and STT‑Arena stress robustness under evolving missions and spatio‑temporal disruptions.

Planner‑centric agents construct global DAGs to manage multi‑tool dependencies, yet existing suites still rarely isolate recovery from tool failures or hallucinations.

Robustness of Tool‑Integrated Reasoning agents has become a focal point, with works injecting dynamic command generation, tool‑response manipulation, and noise to probe reliability.

Benchmarks such as $\tau$‑bench, AgentNoiseBench, ReliabilityBench, AgentProp‑Bench, and ToolGym each introduce distinct perturbation modes—user noise, tool noise, fault injection, or intermediate failures—to evaluate how agents maintain performance under realistic disturbances.

The TOOLMAZE Framework

We describe the curated tool corpus and the topological task complexity levels used to build benchmark DAGs.

The benchmark constructs tasks from a curated tool corpus, then varies them along two orthogonal axes—task‑graph topology and perturbation mode—to form a structured evaluation space.

A three‑stage pipeline that (1) builds DAG‑structured tasks from a curated set of external tools, (2) injects controlled perturbations, and (3) evaluates agents on recovery and success metrics.

How does the TOOLMAZE pipeline differ from a generic data‑augmentation routine?

Unlike simple augmentation that only modifies input text, TOOLMAZE explicitly models tool‑level failures (transient vs. permanent) and enforces a DAG structure that respects functional roles, so the agent must both discover a viable execution path and recover from injected errors.

Task construction links the three tools in order, satisfying the requirement of at least one Source and one Action node.

The perturbation engine marks $t_2$ with a “Transient” flag; the engine schedules a single retry after the failure.

The agent invokes $t_1$ (succeeds), then $t_2$ (fails), retries $t_2$ (succeeds on second attempt), and finally $t_3$ (succeeds), producing a complete answer.

This toy run shows how the framework forces the agent to handle a temporary tool failure while preserving the overall DAG semantics.

Four DAG templates that increase the number of parallel branches and the depth of sequential dependencies, providing a graded difficulty ladder for agents.

Why isn’t C2 just a special case of C3 with some branches omitted?

Although C2’s graph can be embedded in C3, the benchmark treats them as distinct because C2 guarantees a single action node, whereas C3 allows multiple actions and thus a richer set of recovery choices; agents must therefore learn different coordination strategies.

**Figure 2.** Overview of the TOOLMAZE framework, illustrating its main components: (1) task generation from a curated tool corpus, (2) four levels of topological task complexity (C1–C4), (3) 2 × 2 taxonomy of perturbation modes (P1–P4), and (4) the evaluation framework with metrics including TSR, PRR, and RC.

Perturbation Modes

Construct a two‑dimensional benchmark by varying task topology and perturbation mode.

The evaluation matrix systematically varies two orthogonal axes—task topology and perturbation mode—to probe an agent’s fault tolerance and path‑recovery capability.

Four perturbation modes combine the nature of the failure (explicit vs. implicit) with its longevity (permanent vs. transient), yielding a full factorial coverage of realistic tool‑failure scenarios.

How do these perturbation modes differ from standard error‑handling heuristics?

Standard heuristics usually retry on any exception (treating all failures as transient). Here the modes separate explicit from implicit failures and permanent from transient, so an agent must both detect silent semantic errors and decide whether a simple retry suffices or a full path replanning is required.

With the perturbation taxonomy defined, the benchmark construction proceeds by selecting a topology level and a perturbation mode, then instantiating the corresponding DAG instance.

Choose a Topological Task Complexity level (1–4) that fixes the DAG shape and the number of alternative tool‑call paths.

Select a Perturbation Mode (P1‑P4) that determines the failure type and its persistence.

Combine the chosen topology and perturbation to generate a concrete benchmark instance.

Run the LLM agent on the instance, recording task success and path‑recovery metrics.

Repeat across all 4 × 4 combinations to fill the evaluation matrix.

The DAG contains three parallel sub‑graphs, each offering two interchangeable tool calls.

An implicit transient fault corrupts the output of the first tool in sub‑graph $A$, producing a syntactically valid JSON with a wrong field value.

The agent detects the semantic mismatch, retries the same tool (transient), receives a correct output, and proceeds without rerouting.

Metrics recorded: task succeeds (TSR = 1) and the recovery required only a single retry (PRR = 1 for this instance).

This example shows that implicit transient faults demand content verification before retry, whereas explicit transient faults would be caught by an exception.

The resulting 4 × 4 matrix captures how agents trade off planning breadth against fault‑type handling, providing a comprehensive picture of their robustness.

Task Generation Pipeline

The pipeline builds tool‑call DAGs, enumerates all recovery paths, and turns them into validated natural‑language tasks.

This section details how the evaluation matrix is populated. We first synthesize a directed‑acyclic graph of tool calls, then enumerate every admissible recovery path, and finally translate the graph into a validated natural‑language request.

Step 1 – DAG Assembly and Validation: an LLM‑based DAG Architect samples tools from the corpus, connects them into a graph whose nodes are tool calls and edges are data‑flow dependencies, then a secondary LLM checks (a) that the graph is acyclic and (b) that each node’s output can be bound to its successors’ inputs.

Step 2 – Solution‑Space Enumeration: functionally equivalent tools or sub‑graphs are clustered, yielding 1‑to‑N substitutability relations; traversing these alternatives produces the exhaustive set of valid topological orderings, after which redundant cycles and suboptimal chains are pruned and the shortest sequence $s^*$ is recorded as the baseline recovery path.

Step 3 – Task Naturalisation: the DAG is collapsed into a concise task specification (e.g., {"goal":"stock price in EUR","entity":"AAPL"}); a second LLM rewrites this skeleton into a fluent user query, and an independent LLM reconstructs the tool dependencies from the query; the task is kept only if the reconstructed dependencies exactly match the original DAG.

Assemble nodes: A → B, where A’s output field is “`price_usd`” and B expects “`amount_usd`”.

Semantic validation confirms that “`price_usd`” can feed into B’s “`amount_usd`” argument.

Cluster substitution: Tool A and the alternative finance API are marked interchangeable, creating two parallel paths A₁→B and A₂→B.

Enumerate orderings: both paths are topologically identical (A→B), so the solution space contains exactly two valid recovery strategies.

Shortest baseline $s^*$ is the single‑step sequence A→B (no extra hops).

Distill to spec: {"goal":"stock price in EUR","entity":"AAPL"}; LLM rewrites to “What’s Apple’s current price in euros?”

Reverse‑validation reconstructs A→B from the query; match succeeds, so the task is accepted.

This toy example shows how substitutable tool nodes multiply the number of admissible recovery paths without changing the DAG’s topology, directly affecting the Path Recovery Rate metric.

Runtime Perturbation Engine

The engine injects deterministic faults into tool calls during agent inference.

The Perturbation Engine sits between the agent and the tool simulator. It enforces a predefined fault profile for each task, ensuring that every model sees the same synthetic failures. This deterministic behavior isolates the effect of tool faults from stochastic variation.

The engine follows a per‑task fault profile, replacing the real tool response with a fixed synthetic reply whenever the call matches a rule.

How does this differ from typical random fault injection used in robustness testing?

Random injection samples failures on the fly, so each model may see a different set of faults. Deterministic injection fixes the failure per task, guaranteeing that every model encounters the exact same synthetic response, which isolates the agent’s recovery behavior from stochastic noise.

Load the task’s perturbation profile (target tool $s^*$ and synthetic response).

Agent issues a tool call during inference.

Engine compares the call against the profile’s fault rule.

If the call matches, return the predefined synthetic response.

If the call does not match, forward the request to the standard tool simulator.

Mark the targeted tool as permanently unavailable for subsequent calls in the same episode.

Engine loads the profile: {tool = search, response = 404}.

Agent calls “search”. Engine finds a match in the profile.

Engine returns the synthetic 404 response instead of the real search results.

Subsequent calls to “search” in this episode are blocked, simulating a permanently unavailable tool.

This concrete run shows how a single deterministic fault can cascade, forcing the agent to reroute or abort without any randomness.

**Table 1.** Domain distribution, tool-set similarity, and topological path statistics across C1-C4.

Evaluation Metrics

Defines the three core metrics and how they are measured across perturbations.

We evaluate agent robustness along three complementary dimensions: overall completion (Task Success Rate), recovery capability (Perturbation Recovery Rate), and replanning efficiency (Recovery Capability).

TSR measures the fraction of evaluation runs that finish the assigned task, irrespective of any perturbations encountered.

How does TSR differ from a simple accuracy measure on individual tool calls?

TSR looks at the end‑to‑end task outcome; a run counts as successful only if the overall goal is achieved, even if some intermediate tool calls fail but are recovered.

PRR captures the proportion of perturbed runs where the agent successfully recovers after a fault.

Why not just report the raw number of recovered runs?

Raw counts conflate recovery ability with perturbation frequency; PRR normalizes by the number of perturbed runs, isolating the agent’s true recovery capability.

RC measures how close the agent’s actual execution cost is to the optimal cost for the same task.

Is a lower RC always better, even if the agent chooses a safer but longer path?

Yes. RC compares the observed cost to the optimal cost; taking extra safe steps inflates c($\tau$) and raises RC, indicating less efficient replanning. The metric assumes the optimal plan is known for evaluation.

In multi‑path tasks, faults are attached to the first tool the agent invokes within an alternative group, preventing the same group from being perturbed multiple times. For globally single‑alternative groups the rule applies once per trajectory; for tasks with parallel slots it applies independently per slot.

Select a perturbation mode $m$ (NP, P1‑P4) and a prompt condition (with or without hint).

Run a fixed number of evaluation trajectories (e.g., 100) for the chosen mode and prompt.

For each trajectory record the binary flags Isucc, Ipert, Irecov and the step count c($\tau$).

Compute TSRₘ, PRRₘ, and RC for the mode using the formulas defined above.

Average the per‑mode values to obtain the overall scores reported in Table 2.

**Table.** Performance comparison of various models with and without hints across different metrics (TSR, PRR, RC) and difficulty levels (P1-P4).

Experiments Overview

We report recovery performance and cost across perturbation modes.

The experiments introduce two new evaluation dimensions: Path Recovery Rate (PRR), which measures whether an agent can recover from a tool failure regardless of final task outcome, and Recovery Cost (RC), which penalizes unnecessary tool calls during recovery.

RC captures how efficiently an agent replans after a perturbation by comparing the actual number of tool calls it makes to the theoretical minimum.

Main Results

Key performance findings across perturbations and composite model rankings.

The paper evaluates LLM agents on the TOOLMAZE benchmark, which stresses dynamic path discovery and error recovery under varied task complexities and perturbation modes, unlike prior happy‑path tests. Table 2 summarizes the primary results.

Failure‑aware prompts consistently boost model performance under perturbations.

Across all evaluated models, the with‑hint configuration yields improvements ranging from +1.5 % to +20.8 %.

When moving from the ideal Non‑Perturbed condition to any perturbation mode, all models suffer large drops in TSR and PRR and see higher RC, indicating that tool‑use robustness is not a by‑product of general instruction ability.

Complexity Analysis

Complexity escalation degrades success metrics despite constant perturbation averaging.

Performance on TSR and PRR declines while RC rises as topological task complexity increases from C1 to C4.

Figure 3 shows mean TSR and PRR peak at C2 and drop at higher levels, whereas RC exhibits the opposite trend.

The perturbation dimension is held constant across all runs, so observed trends isolate the effect of increasing topological complexity.

**Figure 3.** Mean metrics across complexity levels C1–C4, averaged over all evaluated models. Solid (dashed) lines correspond to the w/ (w/o) hint prompt.

Recovery Trends and Scaling

PRR falls while RC rises with harder perturbations, and scaling improves task success more than fault‑tolerance.

PRR falls while RC rises as perturbations become harder, and the failure‑aware hint prompt consistently yields higher PRR and lower RC than the standard prompt.

Across modes $P1\!-\!P4$, PRR drops from >90 % to <20 % and RC climbs from <10 % to >70 %.

**Figure 4.** Mean PRR and RC across perturbation modes P1–P4, averaged over all evaluated models. Solid (dashed) lines correspond to the w/ (w/o) hint prompt.

All comparisons keep model architecture, training data, and evaluation protocol constant; only the perturbation mode and prompt variant change.

The Implicit-Explicit Trust Gap

Benchmarks miss tool failures; here we quantify the trust gap between explicit and implicit perturbations.

Benchmarks typically assume tools work perfectly, but real deployments face silent failures. This section revisits that premise and measures how much agents over‑trust faulty tool outputs.

The gap quantifies how much worse agents recover when tool failures are hidden (implicit) versus when they are signaled (explicit).

What exactly does the implicit‑explicit PRR gap measure?

It measures the drop in Path Recovery Rate when a tool’s failure is hidden (implicit) compared to when the failure is explicitly indicated (explicit), holding everything else constant.

**Figure 6.** Average PRR performance (w/ hint) between explicit perturbation and implicit perturbation.

The implicit‑explicit PRR gap averages 37.15 % across all models and perturbation settings.

Figure 6

For transient perturbations the gap is 53.75 %.

For permanent perturbations the gap is 20.54 %.

The gap stays strictly positive for every model, indicating a systematic blind spot to implicit errors. The smaller permanent gap stems from a floor effect: $PRR$ is already low (38.12 %) for explicit permanent failures, leaving little headroom for further degradation. Solving explicit errors is inherently hard, requiring complex replanning or graceful termination.

Qualitative Case Studies

We examine how agents handle tool failures across explicit/implicit and transient/permanent perturbations.

Explicit‑Transient (`C1_task_089`): the tool `convert_datetime` raises an explicit InternalServerError on the first call. The successful model retries until the tool returns a valid timestamp; the failing model calls it once and aborts the downstream pipeline, leaving the reservation incomplete.

Explicit‑Permanent (`C1_task_089`): `convert_datetime` returns a permanent AccountQuotaExhausted. The successful model detects the irrecoverable error and stops gracefully; the failing model ignores the stop condition and continues invoking downstream tools, violating the stop‑on‑permanent‑error rule.

**Figure 9.** $C1\mathcal{P}3$ — Implicit-Transient (`C1_task_089`). The victim tool `convert_datetime` returns semantically corrupted data. The successful model detects the anomaly and retries, obtaining a clean response; the failing model accepts the corrupted output without re-querying and propagates the erroneous result downstream.

**Figure.** Case `C1_task_089` P4 — Implicit-Permanent. The figure illustrates a task execution flow involving tool calls, comparing a successful abortion by 'gemini-3.1-pro-preview' and a failure by 'claude-sonnet-4-6' when encountering invalid data from the '`convert_datetime`' tool.

**Figure 11.** $C2\mathcal{P}1$ — **Explicit-Transient** (`C2_task_084`). In a task with an alternative IoT-control path, the victim tool `adjust_temperature` returns an explicit error. The successful model retries and completes all shared downstream steps; the failing model skips the victim entirely, omitting the required shared tool invocations after recovery.

**Figure.** Case C2: `C2_task_084` P2 — Explicit-Permanent. The figure illustrates a task execution scenario involving two different AI models, glm-5.1 and MiniMax-M2.7, attempting to control IoT devices based on a user prompt. It includes a workflow diagram, a success case for glm-5.1, a failure case for MiniMax-M2.7, and a specific error output for a perturbed tool call.

**Figure 13.** $C2\mathcal{P}3$ — **Implicit-Transient** (`C2_task_084`). The victim tool returns corrupted sensor data. The successful model detects the semantic inconsistency, retries and recovers; the failing model accepts the corrupted reading and proceeds without re-querying, never invoking the required shared tool `adjust_temperature`.

**Figure 14.** $C2\mathcal{P}4$ — **Implicit-Permanent** (`C2_task_084`). Persistent semantic corruption in the victim tool. The successful model recognises the unrecoverable state and halts after completing reachable shared steps; the failing model propagates the corrupted value silently into `adjust_temperature`.

**Figure 15.** $C3P1$ — Explicit-Transient (`C3_task_040`). In a multi-branch weather-alert task, the victim tool `get_weatherapi_alert_card_native` returns an explicit error. The successful model retries until success and completes a valid path; the failing model neither retries sufficiently nor switches branches, leaving no valid execution path completed.

Explicit‑Permanent (`C3_task_040`): `get_weatherapi_alert_card_native` fails with LicenseExpiredOrInvalid. The successful model abandons the failed branch and completes the task via the VisualCrossing fallback; the failing model remains stuck on the failed tool and never produces a valid path.

**Figure 17.** $C3\mathcal{P}3$ — **Implicit-Transient** (`C3_task_040`). The victim tool returns semantically corrupted weather data. The successful model detects the inconsistency and retries to obtain clean data; the failing model exhibits *sanity ignorance*—it consumes the corrupted output without verification and propagates it downstream.

**Figure 18.** $C3 \mathcal{P}4$ — Implicit-Permanent (`C3_task_040`). The victim tool persistently returns corrupted data. The successful model identifies the irrecoverable corruption and reroutes to a clean alternative branch to complete the task; the failing model never completes any valid path after the victim permanently fails.

**Figure.** Case C4: Explicit-Transient task execution flow. The left panel shows the original task sequence. The top-right panel shows a successful execution by MiniMax-M2.7, and the bottom-right panel shows a failure by qwen3.6-27b. The bottom-left box displays the perturbed output for the '`get_recommendations`' tool.

Read the original paper

Open the simplified reader on Paperglide