AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji

AdaPlanBench evaluates how LLM agents adaptively re-plan when world and user constraints are revealed only upon violation.

How do current LLM agents perform when tasked with adaptive planning under world and user constraints that are only revealed incrementally during interaction?

LLM agents often fail to adapt when real-world tasks impose constraints that are not fully specified upfront, such as hidden tool limitations or evolving user preferences. AdaPlanBench forces agents to discover these constraints through a multi-turn interaction protocol where violations trigger feedback, requiring the agent to iteratively revise its plan. Even the strongest models struggle, with top-performing agents achieving only 67.75% accuracy and performance degrading significantly as the constraint burden accumulates.

Paper Primer

The benchmark uses a scalable pipeline to augment 307 household tasks with dual-constraint profiles. The core mechanism is a dynamic feedback loop: the agent proposes a plan, and if it violates a hidden world or user constraint, the environment discloses the violation, forcing the agent to re-plan under the updated, more restrictive set of requirements.

Adaptive planning under progressively disclosed constraints remains a major failure point for current LLMs.

Across ten leading models, accuracy drops as constraint complexity increases, with most models failing to maintain consistent plan quality over multiple turns. The best-performing model (GPT-5) achieves 67.75% accuracy, while most open-weight models score below 30%.

Explicit constraint tracking and rubric feedback provide only marginal improvements to final task success.

Providing agents with a history of disclosed constraints improves valid plan rates but fails to recover end-task accuracy, as models struggle to integrate these requirements into globally effective plans. Explicit constraint tracking yields less than 3% improvement in accuracy for most models.

Why is this benchmark necessary if existing agent benchmarks already test planning?

Existing benchmarks typically focus on either user preferences or world limitations in isolation, and often provide constraints upfront. AdaPlanBench is unique in requiring agents to handle both simultaneously while discovering them incrementally through interaction.

What does it mean for a model to "fail" in this benchmark?

A failure occurs if the agent terminates early, exceeds the turn budget, or produces a final plan that violates any world or user constraint or fails to meet the rubric-based quality threshold.

Introduction: The Challenge of Adaptive Planning

Identifies the gap in evaluating adaptive planning under dual constraints.

Current interactive benchmarks rarely test agents’ ability to handle constraints that appear only after a plan is proposed. Real‑world tasks involve two intertwined sources of restriction—world constraints (e.g., tool availability) and user constraints (e.g., preferences)—which are often disclosed incrementally during execution.

Agents must generate and revise plans while simultaneously respecting constraints that originate from the environment and from the user, each of which may be revealed only after the agent’s current proposal violates them.

To expose this deficiency, we introduce AdaPlanBench, a dynamic benchmark built on 307 household tasks that injects dual constraints via a scalable construction pipeline. During evaluation, agents interact in a multi‑turn protocol where hidden constraints are revealed only when a proposed plan violates them, forcing continual inference and re‑planning.

Existing Benchmarks and Limitations

We compare AdaPlanBench to prior benchmarks across seven planning properties.

Table 1 contrasts eleven prior benchmarks with AdaPlanBench across seven planning properties.

AdaPlanBench is the only benchmark with full coverage (✓) across all seven properties.

Prior benchmarks typically address either user constraints or world constraints, but not both simultaneously.

Benchmark Comparison systematically aligns each benchmark with a set of planning properties, revealing which capabilities are fully, partially, or not supported.

Two major challenges motivate this benchmark: (1) constraints are revealed progressively, and (2) the action space is large and open‑ended.

AdaPlanBench evaluates whether agents can adaptively re‑plan as dual constraints emerge during interaction.

AdaPlanBench: Design and Methodology

Constructs a dual‑constraint benchmark and defines the interactive evaluation protocol.

AdaPlanBench packages household‑task queries together with hidden world and user constraints, then forces an agent to discover those constraints through interactive feedback.

How does AdaPlanBench differ from a static benchmark that lists all constraints up front?

In AdaPlanBench constraints are hidden until the agent’s plan triggers a violation, so the agent cannot simply plan against a known checklist; it must infer missing constraints from partial feedback.

Constraints are revealed incrementally: each planning turn uncovers only those constraints that the current plan violates.

Why not simply give the agent the full constraint set at the start?

Providing the full set would eliminate the adaptive challenge; the benchmark’s purpose is to test an agent’s ability to discover and incorporate hidden constraints on the fly.

World constraints describe what the environment permits; user constraints describe what the user prefers.

Can a world constraint ever be overridden by a user constraint?

No; constraints are conjunctive. A plan must respect both sets simultaneously; violating either leads to feedback.

**Figure 1.** Overview of AdaPlanBench. Top: data construction, where dual constraints are constructed for each query. Middle: runtime interaction, where the agent proposes plans, receives feedback on violated constraints, and re-plans iteratively. Bottom: an example trajectory showing how hidden constraints are progressively disclosed during interaction.

Data construction proceeds through a deterministic pipeline that first rewrites raw queries, then iteratively samples plans, extracts constraints, and merges them into a final profile.

Rewrite each raw MacGyver query `q_raw` into a concise household query q using the rewriter model Mrw.

Filter q with binary model Mflt to keep only multi‑step, concrete tasks.

Initialize empty world and user constraint pools $B_w^{(j),0}$=∅, $B_u^{(j),0}$=∅ for each planner sampler $M^{(j)}$.

For r = 1 … R rounds, each planner samples plans $\pi_{r}$^{(j)} = $M^{(j)}$(q, $B_w^{(j),r-1}$, $B_u^{(j),r-1}$).

Extract tools $T_r^{(j)}$ = Mext(q, $\pi_{r}$^{(j)}) and convert them into candidate world constraints $C_w^{(j),r}$ and user constraints $C_u^{(j),r}$.

Merge new candidates with existing pools using Mmerge, producing updated pools $B_w^{(j),r}$ and $B_u^{(j),r}$.

After R rounds, aggregate pools across planners, validate with Mchk, and output the final dual‑constraint profile ($B_w$, $B_u$).

Tool extraction yields $T_1^{(1)}$ = {hair‑dryer}.

World constraint candidate $C_w^{(1),1}$ = “hair‑dryer not plugged in” (environmental limitation).

User constraint candidate $C_u^{(1),1}$ = “avoid high‑heat devices” (preference).

Mmerge adds these to the pools: $B_w^{(1),1}$ = {“hair‑dryer not plugged in”}, $B_u^{(1),1}$ = {“avoid high‑heat devices”}.

This toy run shows how a single plan instantly seeds both a physical limitation and a user preference, which will steer subsequent sampling toward safer, plug‑in tools.

After constructing the profile, instances are labeled Elow, Emid, or Ehigh according to the total number of constraints; Table 2 shows the average counts.

Runtime interaction loop for a single AdaPlanBench instance.

The loop terminates when a plan satisfies both constraint sets, the turn budget T is exhausted, or two consecutive turns produce no new violations, indicating stagnation.

Each turn also receives a rubric score (1–5) on tool feasibility, physical plausibility, effectiveness, and safety; a plan passes only if every dimension meets the threshold $\gamma$ = 4.

Experimental Setup and Results

Proprietary LLMs markedly outperform open‑source models, yet all struggle with progressive dual constraints.

Proprietary models achieve a large accuracy lead over open‑source models.

GPT‑5 reaches 67.75 % accuracy while the best open‑source model (Qwen3‑32B) attains only 17.92 % (Table 3).

**Table 3.** AdaPlanBench evaluation results under $\mathcal{E}_{mid}$. Scores in bold and underline indicate the best and second-best performance, respectively. Avg Turns is averaged over all instances, including early-stopped trajectories. We highlight the top two ATWC and ATUC values for comparison, although higher is not always better. Confidence intervals are in Appendix C.8.

Performance Analysis and Sensitivity

Increasing constraints erode model planning quality, with modest mitigation from tracking.

Planning must adapt to both world and user constraints that emerge progressively during interaction, as the benchmark evaluates this adaptive capability.

Model accuracy declines sharply as the combined constraint burden rises from low to high profiles.

Figure 2 shows a steady drop in both accuracy and valid plan rate across the three environment profiles.

**Figure 2.** Model performance under increasing constraint burden. Performance drops steadily as the environment profile becomes more constrained, suggesting that current models are highly sensitive to growing dual-constraint complexity.

Planning quality deteriorates over interaction turns as constraints are progressively disclosed.

Figure 3 plots rubric scores per turn, revealing consistent declines across most dimensions.

**Figure 3.** Selected model rubric scores across interaction turns under $\mathcal{E}_{mid}$. Performance deteriorates as progressively disclosed constraints accumulate within a trajectory, indicating that models struggle to maintain stable planning quality over interactions.

Appending previously disclosed constraints improves valid plan rate but yields only a small accuracy increase.

Figure 4 compares memory‑off and memory‑on configurations, showing a modest accuracy rise and a larger VPR boost.

**Figure 4.** Model performance under $\mathcal{E}_{mid}$ with additional constraint tracking module. Explicitly providing prior disclosed constraints brings only limited improvement on accuracy.

Table 4 reports average rubric scores under the medium‑constraint profile, highlighting that Effectiveness and Physical dimensions are consistently weaker.

Rubric‑based refinement modestly lifts accuracy but sharply reduces valid plan rate.

Figure 5 shows accuracy rising up to ~10% while VPR falls by up to 40% for open‑source models.

**Figure 5.** Model performance under $\mathcal{E}_{mid}$ with rubric-based refinement. Additional feedback yields only modest recovery and often destabilizes planning.

User‑only constraints cause larger performance degradation than world‑only constraints, and the combined setting is hardest.

Figure 6 shows lower accuracy and VPR for the user‑only and both‑constraints settings.

**Figure 6.** Model performance under $\mathcal{E}_{mid}$ across constraint sources. User constraints cause larger degradation than world constraints, and dual-constraint setting is the hardest.

Effectiveness and physical grounding are the weakest dimensions under accumulated dual constraints.

Table 4 shows lower scores for Effectiveness and Physical across models, and Figure 5/6 reflect similar trends.

Limitations and Future Directions

We discuss the current benchmark’s domain, evaluation, modality, and constraint modeling limits.

Limited Domain Coverage. AdaPlanBench is instantiated only in the household domain, which provides a natural testbed but omits phenomena from travel, office workflows, or robotics.

Potential Bias in LLM‑based Evaluation. The benchmark uses LLM judges for constraint checking and rubric scoring, which can introduce model‑specific preferences.

This reliance may systematically favor agents that align with the judge model’s style, obscuring true planning ability and limiting cross‑model comparability.

Text‑only Evaluation Setting. The benchmark evaluates planning solely through textual interaction, excluding visual perception, embodied execution, and real‑world actuation.

Simplified Constraint Modeling. Constraints are represented as object‑based world constraints and attribute‑based user constraints, which lack the compositional richness of real‑world preferences.

Preprint status. The work is currently under review and may evolve before final publication.

Related Work and Formalization

Survey of constraint‑aware planning benchmarks and the environment‑construction pipeline.

Planning under constraints is central to agentic decision making, and existing evaluations have studied constraints from either the world side or the user side. World‑side benchmarks include PDDL constraints (Valmeekam et al., 2023), time/availability constraints (Zheng et al., 2024), workflow rules (Xiao et al., 2024) and API rules (Trivedi et al., 2024); user‑side benchmarks emphasize preferences (Zhao et al., 2025; Guo et al., 2026b), personalization (Jiang et al., 2025a;b) and user intent (Qian et al., 2025a; Wang et al., 2024c).

Recent benchmarks have begun to incorporate both world‑side and user‑side constraints in interactive settings. CostBench (Liu et al., 2025a) considers dual constraint types but only with upfront constraints, while FlowBench (Xiao et al., 2024) remains workflow‑centric and covers a limited scope. Other benchmarks model progressively elicited user preferences but often assume limited action spaces (Xie et al., 2024; Luo et al., 2025) or lack scalable constraint construction (Yao et al., 2024).

In most existing settings constraints are provided proactively by the environment rather than uncovered through the agent’s own exploration. These settings also omit iterative replanning, which repeatedly collapses the current plan and forces generation of a new one. Consequently they fail to fully evaluate partially observed, open‑ended adaptive planning under scalable dual constraints.

A parallel body of work develops methods for improving constraint‑aware planning in LLM agents. World‑side methods study state grounding (Kim et al., 2025), localized violation correction (Kumar & Cohen, 2026), experience‑based world‑model refinement (Lee et al., 2026), plan‑quality improvement through training (Erdogan et al., 2025) and symbolic planning enforcement (Malfa et al., 2025). User‑side work focuses on preference elicitation (Qian et al., 2025b; Dou & Liu, 2025) and proactive clarification or personalization during execution (Zhang et al., 2024; Sun et al., 2025; Huang et al., 2025a).

More recent approaches jointly handle world and user constraints via reflective prompting (Guo et al., 2025b), multi‑agent coordination (Choi et al., 2025), executable constraint checking (Deik et al., 2026) and hierarchical control (Bui et al., 2026). However, these methods largely assume constraints are available upfront, not emerging progressively, and they do not account for iterative environmental interventions that disrupt the plan and require repeated replanning.

We now formalize the environment‑construction pipeline used to generate benchmark instances.

Each instance is built via a multi‑agent framework with specialized roles: query rewriting (Mrw), binary filtering (Mflt), planner samplers ($M_j$^plan), constraint extraction (Mext), merging (Mmerge) and checking (Mchk).

For every retained instance we create three hierarchical environment profiles—Elow, Emid, and Ehigh—each containing a world‑constraint set (Bw,*) and a user‑constraint set (Bu,*).

These profiles are generated through R = 3 iterative rounds of constraint induction, where round 1 yields Elow, round 2 yields Emid, and round 3 yields Ehigh.

Data Construction Details

Describes how hierarchical constraint pools are built via iterative multi‑planner sampling.

Initialize each planner’s world‑side and user‑side constraint pools $\tilde{B}_j^{w,0}$ and $\tilde{B}_j^{u,0}$ as empty sets.

For round $r = 1,2,3$, each planner $jsamplesK$ candidate plans $\{\pi_j^{x,r,k}\}_{k=1}^K$ from its model $M_j$ conditioned on the query $q$ and the current pool $\tilde{B}_j^{x,r-1}$.

World‑side plans yield constraint candidates $C_j^{w,r,k}=M_{\text{ext}}(\pi_j^{w,r,k}, q, g)$, where the extractor discards objects already mentioned in the query or reference solution.

User‑side plans first produce tool sets $T_j^{u,r,k}=M_{\text{ext}}(\pi_j^{u,r,k})$; then user‑constraint candidates are inferred as $C_j^{u,r,k}=M_{\text{ext}}(\pi_j^{u,r,k}, T_j^{u,r,k}, q, g)$, excluding preferences that would invalidate the reference solution.

All per‑plan candidates for a given constraint type are summed: $C_j^{x,r}= \sum_{k=1}^{K} C_j^{x,r,k}$ for $x\in\{w,u\}$.

The new candidates are merged into the planner‑specific pool: $\tilde{B}_j^{x,r}\leftarrow M_{\text{merge}}(\tilde{B}_j^{x,r-1}\cup C_j^{x,r})$, which canonicalizes and deduplicates constraints.

After three rounds, the accumulated pools $\tilde{B}_j^{w,3}$ and $\tilde{B}_j^{u,3}$ constitute the hierarchical environment profiles used for downstream validation.

Construction and Validation

Describes how constraint pools are merged, validated, and expanded across rounds.

After all planners finish a round, we aggregate the planner‑specific pools and validate the merged result. For each constraint type we compute a retention ratio and then map the validated outputs to three hierarchical environment profiles.

We first pool all constraints discovered by the planners in a round, then prune the pool to keep only constraints that are specific enough and compatible with the reference solution.

Why not keep all constraints from the raw pool?

Retaining every constraint would introduce vague or contradictory items that make the planning problem unsolvable; the checker ensures every remaining constraint can be satisfied together with the reference solution.

Multiple planners explore the solution space simultaneously, each generating several candidate plans in one pass.

How does parallel sampling differ from simply running one planner longer?

Running one planner longer only deepens its own bias; parallel sampling brings in qualitatively different biases from other planners, exposing constraint patterns that a single planner would never generate.

After each round we feed the validated constraints back into the planners, forcing them to avoid previously discovered strategies.

What would happen if we omitted the iterative feedback?

Without conditioning on previously found constraints, planners would keep rediscovering the same easy constraints, and the later rounds would add little new information, limiting the richness of the final environment profiles.

Combining parallel breadth with iterative depth yields three hierarchical profiles $E_{\text{low}}$, $E_{\text{mid}}$, $E_{\text{high}}$ that are increasingly rich yet remain self‑consistent and solvable.

Temperature Ablation Results

Ablation results show temperature has minor effect while model strength dominates performance.

All experiments fix the decoding temperature at 0.0 and set the maximum completion length to 16 000 tokens. For GPT‑5 series models the temperature is locked at 1.0, and results are averaged over three runs with variance below 3 %. Open‑source models are evaluated on four NVIDIA H100 GPUs.

Data construction uses planner samplers based on GPT‑4.1, DeepSeek‑V3.2, and Qwen3.6‑Flash, while a strong GPT‑5.4 model ($M_{\text{chk}}$) filters invalid constraints. Evaluation judges (world‑ and user‑constraint) also rely on GPT‑5.4, and rubric‑based scoring employs the same three models as independent judges.

The temperature ablation shows only modest swings (≤ 3 %) in both accuracy and valid plan rate, confirming that the observed gains stem from model capability rather than decoding settings.

**Table 5.** Ablation results under different decoding temperatures. $\Delta_{max}$ denotes the maximum performance difference across temperatures for each model, computed as the difference between the largest and smallest values among the tested temperatures.

Filter‑model validation reveals that $M_{\text{chk}}$ discards 42.18 % of constraints, with a false‑negative rate of 2.31 % and a false‑positive rate of 3.72 %.

Runtime judge validation shows the LLM judges match human majority labels on 89.76 % of turns, deviating by at most one constraint in 161 of 166 turns, confirming reliable evaluation.

Runtime Interaction Details

Describes how constraint violations are prioritized and turned into user feedback during planning.

At turn $t$, the agent proposes a plan $p_t$ and the judges identify the violated world constraints $V_w^t\subseteq B_u$ and user constraints $V_u^t$. The revealed constraint set $V_t^{(cid:98)}$ is chosen by a single‑type rule: if $V_w^t$ is non‑empty we reveal $V_w^t$, otherwise we reveal $V_u^t$ (or nothing if both are empty).

World constraints are prioritized because they encode hard feasibility conditions such as missing tools or unavailable materials, making a plan instantly infeasible in reality. User constraints are softer preferences that can be negotiated, so they are only disclosed when no world violations exist.

Once $V_t^{(cid:98)}$ is selected, it is handed to the user simulator $M_{\text{user}}$, which rewrites the constraint items into explicit feedback for the next turn. If the agent repeats a previously disclosed violation, $M_{\text{user}}$ explicitly reminds the agent; violations of the non‑selected type are withheld until they become the highest‑priority set.

The feedback rule does not alter the underlying constraint checking: each turn the plan is still evaluated against the full profile $E=(B_w,B_u)$. Repeated‑violation metrics are computed over the disclosed history, counting a violation only when it reappears after being revealed.

Evaluation Rubrics

We evaluate plans with rubric scores and track constraint disclosures across interaction turns.

The evaluation pipeline scores each generated plan on eight rubric dimensions and records which constraints become disclosed during interaction.

**Table 6.** Definitions of evaluation rubrics.

Instead of a binary pass/fail, each plan is graded on multiple quality axes, giving a nuanced picture of its overall suitability.

Prompting Strategy

This section defines the formal criteria for rubric-based evaluation and constraint satisfaction in AdaPlanBench.

To evaluate agent performance, we define two primary success conditions: rubric-based quality assessment and strict constraint satisfaction. These indicators determine whether a generated plan is both high-quality and valid within the task environment.

A plan is considered successful only if it meets a minimum quality threshold across all defined dimensions, as judged by an ensemble of evaluators.

A plan is valid only if it leaves no world or user constraints violated at the terminal turn.

Finally, we define Accuracy (Acc.) as the joint success of these two conditions. A plan is accurate if and only if it satisfies all constraints ($\text{ConPass} = 1$) and meets the quality threshold ($\text{RubPass} = 1$) at the terminal turn.

Constraint Tracking Metrics

Defines the core metrics used to evaluate constraint tracking performance.

This section formalizes the quantitative measures that capture how well an agent tracks and respects both world and user constraints during planning.

Acc reports the fraction of episodes where the agent simultaneously satisfies the disclosed constraints and the rubric criteria.

VPR measures how often an agent’s generated trajectory ends in a plan that obeys all constraints.

Avg Turns captures the typical length of the interaction needed to reach a plan.

AWRV quantifies how often an agent re‑offends constraints that have already been disclosed.

Rubric Refinement Metrics

Defines the AURV metric and enumerates the prompt templates used for feedback.

Average User Repeated Violations (AURV) quantifies how often an agent disregards constraints it has already been told to respect. It is computed by first averaging repeated violations within each episode and then averaging those episode‑level scores across the whole dataset.

The evaluation harness uses five prompt templates: a User Simulator prompt, a World‑constraint judge prompt, a User‑constraint judge prompt, an Agent runtime prompt, and a Rubrics Judge prompt. Each template supplies the next turn’s feedback to the planning agent, grounding the metric calculations in concrete interaction data.

Statistical Confidence

We quantify the uncertainty of each model’s accuracy with 95 % confidence intervals.

To assess how reliably each model solves the planning tasks, we compute a two‑sided 95 % Wald confidence interval for its accuracy over the 307 sampled trajectories.

Read the original paper

Open the simplified reader on Paperglide