Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Jiachun Li, Zhuoran Jin, Tianyi Men, Yupu Hao, Kejian Zhu, Lingshuai Wang, Dongqi Huang, Longxiang Wang, Shengjia Hua, Lu Wang, Jinshan Gao, Hongbang Yuan, Ruilin Xu, Kang Liu, Jun Zhao

A systematic survey of agentic environment engineering, mapping the lifecycle of modeling, synthesis, and co-evolution.

How do we systematically engineer, synthesize, and evolve the environments that LLM-based agents interact with to improve their capabilities?

Large Language Model (LLM) agents require interactive environments to acquire skills, yet these systems lack a unified framework for design and evaluation. This survey categorizes agentic environments across their lifecycle, defining a taxonomy of eight attributes and eight domains while detailing paradigms for automated synthesis and agent-environment co-evolution. The authors identify a critical gap in current research: the need to bridge the reliability of symbolic systems with the generative scalability of neural world models.

Paper Primer

The paper frames environment engineering as the "inseparable twin" of agent development, moving beyond static datasets to closed-loop systems. It introduces a lifecycle-based taxonomy that covers how environments are modeled, synthesized, and used to drive agent evolution.

The core contribution is a systematic classification of environment engineering: it maps existing work into eight attributes (e.g., Symbolic vs. Neural, Online vs. Offline) and eight domains (e.g., GUI, Embodied, Code). The authors propose that future progress hinges on "Neural-Symbolic Environments," which combine the verifiable logic of code-based systems with the flexibility of neural world models.

Existing environment research is heavily skewed toward single-agent settings, leaving multi-agent cooperative and competitive environments under-explored.

Comparative analysis of current benchmarks across eight domains (GUI, Embodied, Game, etc.) reveals a lack of frameworks capable of handling collective intelligence and parallel agent interaction. High; this identifies a primary bottleneck for scaling agents to complex, real-world social or collaborative tasks.

Why is "environment engineering" necessary if we already have massive, high-quality datasets?

Traditional data engineering is an open-loop system that treats models as passive recipients of static trajectories. Environments enable closed-loop interaction, allowing agents to receive real-time feedback, correct errors, and adapt task difficulty to their current proficiency level.

What is the fundamental trade-off in current environment synthesis?

The field is currently split between symbolic synthesis (using code/rules for reliability) and neural synthesis (using world models for generative scalability). The former is stable but inflexible, while the latter is open-ended but often inconsistent or uncontrollable.

Researchers should shift focus from agent-centric reasoning to environment-centric engineering, specifically targeting the integration of symbolic reliability and neural scalability to enable autonomous, long-horizon agent evolution.

The Landscape of Agentic Environments

Introducing agentic environment engineering: dynamic, interactive systems that replace static data curation.

Large language models now exhibit agentic capabilities such as tool use, long‑horizon planning, and self‑improvement. Deploying these agents in the real world is prohibitive because of safety, cost, and reproducibility concerns. Simulated, interactive environments therefore become the essential testbed for continual capability growth.

It is the systematic design of dynamic, interactive simulation systems that replace static data pipelines as the primary driver of agent capability growth.

**Fig. 1.** An overview of agentic environment engineering.

Guided by three research questions, we first catalog environment attributes and domains, then examine symbolic and neural synthesis pipelines, and finally explore how environments catalyze agent and environment evolution. The survey reveals gaps—particularly in diversity, complexity, and fidelity evaluation—that motivate future work.

The core shift is moving from static data curation to interactive, automatically generated environments that continuously drive agent capability.

Formalizing the Agent-Environment Interaction

Defines POMDP environments, agents, and maps evolution camps into a taxonomy.

We first formalize the interaction between an agent and its world as a Partially Observable Markov Decision Process (POMDP), then organize the three evolution camps—Environment Engineering, Agent Evolution, and Environment Evolution—into a concise taxonomy.

A POMDP captures a stochastic system that an agent perceives only through noisy observations, providing the formal backbone for any interactive environment.

An agent is a decision‑making policy that maps its interaction history to a distribution over actions.

Alignment means shaping the environment so that the agent’s optimal policy $\pi^{*}$ yields high expected return.

PPO stabilizes policy updates by clipping the probability ratio between new and old policies.

GRPO replaces the critic with a sample‑based baseline, normalizing advantages across a group of outputs.

DAPO refines GRPO by using separate upper and lower clipping bounds and applying the loss at the token level.

Environment engineering replaces static datasets with interactive worlds that adapt to the agent’s capabilities.

Collaborative evolution lets the environment co‑evolve with the agent, adjusting task difficulty in real time.

Multi‑turn environments let agents decompose problems, invoke tools, and receive immediate feedback.

Closed‑loop environments couple the agent’s actions to subsequent state changes, enabling self‑correction.

We map the full lifecycle of environments—modeling, construction, evaluation, application—complementing agent‑centric surveys.

Design of interactive worlds that generate adaptive experiences for agents.

Progressive improvement of the agent’s internal policy, architecture, or reasoning capabilities.

Systematic transformation of the environment itself, e.g., scaling difficulty or introducing new dynamics.

**Figure.** A hierarchical taxonomy of evolution methods in AI, categorized into Agent Evolution (§6) and Environment Evolution (§7). Agent Evolution is further subdivided into Memory-Centric, Orchestration-Centric, Trajectory-Centric, and Exploration-Centric branches, each containing specific sub-methods (e.g., Instance Trajectory Experience, Fixed Workflow, Task Synthesis, Reasoning Structure) with associated research citations. Environment Evolution is divided into Neural-Driven, Difficulty-Driven, and Scaling-Driven branches, detailing sub-methods like Self-Play, Explicit Curriculum Signals, and Scenario-Level Scaling with their respective citations.

**Fig. 4.** A comparison between data engineering and environment engineering.

Taxonomy of Environment Attributes

Defines the core environment attributes that shape agent perception and decision-making.

Attributes are binary properties of an environment that together determine what the agent can observe, how it can act, and how the world evolves in response.

How do these attribute pairs combine to affect an agent’s learning dynamics?

Each attribute imposes a distinct constraint: e.g., a Neural, Closed‑Loop, Online, POMDP, Nondeterministic, Continuous, Multimodal, Multi‑Agent environment forces the agent to learn from noisy, high‑dimensional feedback while coordinating with peers, dramatically increasing sample complexity compared to a Symbolic, Open‑Loop, Offline, MDP, Deterministic, Discrete, Unimodal, Single‑Agent setting.

Transitions are defined by explicit programmed logic (e.g., PDDL, physics engines) that deterministically update the state.

Transitions are approximated by a learned model $P_{\theta}$, typically a neural network that predicts the next state.

The agent receives only an initial observation and follows a pre‑computed action sequence without further feedback.

The agent continuously incorporates new observations $o_t$ to adapt its actions, forming a reactive feedback loop.

The agent interacts with the live system $E$, receiving real‑time observations after each action.

Interaction is prohibited; the agent is evaluated on a static dataset of pre‑collected trajectories.

The observation space equals the state space ($\Omega = S$), giving the agent full access to the underlying state.

Observations are partial or noisy projections of the true state, requiring the agent to aggregate history.

**Fig. 5.** An overview of environment attributes.

Attributes dictate what information the agent receives and how it can influence the world, directly shaping perception and decision‑making.

Diverse Domains for Agent Evaluation

Map of environment domains and their distinct agent demands.

Agents interact with graphical user interfaces, requiring perception of visual elements and sequential action execution.

Agents continuously retrieve, synthesize, and report evidence from heterogeneous sources to answer research questions.

Agents act as robots or avatars in 3D worlds, perceiving, moving, and manipulating objects.

Agents operate within rule‑based virtual worlds, balancing strategy, reasoning, and interaction.

Agents invoke external functions, APIs, or services to acquire information or perform actions.

Agents work with source code, repositories, tests, and execution feedback to generate, understand, verify, or debug software.

Agents operate in specialized professional settings, needing domain knowledge and adherence to sector‑specific constraints.

Agents must generalize across heterogeneous tasks and environments, testing transferability and robustness.

Different domains stress distinct capabilities—GUI needs robust visual parsing, Deep Research demands persistent retrieval and synthesis, Embodied requires tight perception‑action loops, Game environments test strategic planning, Tool and Code domains hinge on reliable API or compiler interaction, while Domain‑Specific benchmarks impose expert knowledge constraints; a universal agent must expose modular subsystems that can be activated or swapped according to these demands.

**Figure 6.** An overview of environment domains, including GUI, Deep Research, Embodied, Game, Tool, Code, and Domain-Specific.

Automated Environment Synthesis

We shift from static data to dynamic, scalable environment synthesis that drives agent capability.

Agentic environment engineering replaces hand‑crafted data with automatically generated training worlds, enabling agents to learn from ever‑larger, more diverse interactive settings.

Environment synthesis automatically creates training environments—either by executing symbolic rules or by training neural models—so that agents can practice in scalable, verifiable, and diverse worlds.

How does verification differ between symbolic and neural synthesis?

Symbolic pipelines run executable unit tests on generated code, guaranteeing that every transition obeys the prescribed rule set. Neural pipelines instead evaluate consistency by checking that predicted observations match a held‑out validation set or that a learned verifier assigns high confidence to generated trajectories.

Generates environments by composing explicit code or rule sets; verification is performed through execution tests.

Learns a world model that predicts environment dynamics; verification relies on learned consistency metrics.

Define the environment tuple $E = \langle S, A, P, R\rangle$ where $S$ and $A$ are state and action spaces.

Implement transition function $P\!:\!S\times A\!\rightarrow\!S$ and reward function $R$ as executable code $C$.

Choose a synthesis type: (1) Task‑Driven, (2) Real‑World‑Driven, or (3) De Novo.

Generate concrete environments by applying the selected type’s construction rules.

Run automated unit‑test suites to verify correctness and filter out failing instances.

Initial state $s_0=(0,0)$.

Agent takes action “right” → $P(s_0,\text{right})=(1,0)$.

Agent takes action “up” → $P((1,0),\text{up})=(1,1)$.

Agent takes action “right” → $P((1,1),\text{right})=(2,1)$.

Agent takes action “up” → $P((2,1),\text{up})=(2,2)$, reward $+1$.

This toy example shows how explicit code guarantees that every transition is deterministic and testable, a property that neural models must approximate.

**Fig. 7.** Three symbolic environment synthesis methods are presented: Task-Driven Synthesis, Real-World-Driven Synthesis, and De Novo Synthesis. From left to right, the methods offer increasing degrees of freedom and require more verification logic.

Table 5 lists representative symbolic synthesis works, indicating their modality (text, image, video), underlying architecture (LLM or VLM), and evaluation criteria such as correctness and diversity.

Collect a dataset of environment observations (pixels, text, or multimodal trajectories).

Choose a representation level: pixel, word, or latent.

Train a world‑model network to predict the next observation conditioned on the current state and agent action.

Optionally fine‑tune a pretrained foundation model to accelerate learning.

Validate the model using consistency metrics (e.g., FID for pixels, semantic similarity for words) and task‑completion tests.

Encode $I_t$ to latent $z_t$ (dim = 128).

Embed action $a_t$ to vector $e_t$ (dim = 16) and concatenate: $[z_t; e_t]$.

Decode concatenated vector to produce predicted frame $\hat{I}_{t+1}$.

Compute pixel‑wise L2 loss against ground‑truth $I_{t+1}$ and back‑propagate.

Because the model operates directly on pixels, it preserves fine‑grained visual detail but incurs high computational cost.

**Fig. 8.** Three neural environment synthesis paradigms are presented: Pixel-Level Modeling, Word-Level Modeling, and Latent-Level Modeling trading off fidelity and abstraction.

Table 6 aggregates neural synthesis methods, showing modality icons, architecture choices, base models, and evaluation metrics such as correctness, fidelity, and diversity.

Mechanisms for Agent Evolution

Maps the four agent‑evolution paradigms and their trade‑offs.

This section classifies how agents improve their capabilities across four distinct evolution paradigms.

Agent Evolution is the systematic progression of an agent’s abilities, driven by continual interaction with its environment and by internal updates to its parameters.

How does Agent Evolution differ from ordinary fine‑tuning?

Fine‑tuning only updates model weights on a fixed dataset, whereas Agent Evolution also reshapes the surrounding environment (memory, workflow, or task distribution) and may generate new data on the fly, creating a feedback loop between agent and environment.

Agents accumulate and reuse experience stored in external memory bases, ranging from raw trajectories to abstract skill libraries.

Agents adapt the graph‑based workflow that orchestrates multiple sub‑agents, tools, or functions, moving from fixed pipelines to dynamically evolving topologies.

Agents synthesize and refine interaction trajectories offline, using task synthesis, trajectory generation, and quality‑controlled refinement before training.

Agents improve via reinforcement learning, shaping reasoning structures, reward signals, and training algorithms while interacting with the environment in real time.

**Fig. 9.** Overview of agent evolution paradigms. Existing methods are organized into four categories: Memory-Centric Experience Evolution, which enhances agent capabilities through accumulated experiences; Orchestration-Centric Workflow Evolution, which adapts agent workflows to optimize task performance; Trajectory-Centric Offline Evolution, which refines agent behavior through synthetic task interaction data; and Exploration-Centric Online Evolution, which strengthens agent capabilities through real-time learning and adaptation via reinforcement.

Collect raw interaction data and store it in an external memory base (Memory‑Centric).

Design or adapt a workflow graph that routes tasks to appropriate sub‑agents or tools (Orchestration‑Centric).

Synthesize offline trajectories from generated tasks and refine them through filtering or correction (Trajectory‑Centric).

Deploy the agent in the target environment and apply reinforcement learning to close the loop (Exploration‑Centric).

Together, these components form a taxonomy that lets researchers choose the right evolution strategy for their application constraints.

Co-Evolving Environments for Agent Growth

Categorizes environment evolution into three paradigms and outlines their mechanisms.

Environment evolution determines how training environments change over time to keep pace with improving agents.

Environment evolution is the systematic process of adapting or expanding the training world so that it continually challenges the agent and drives capability growth.

How does Environment Evolution differ from the earlier notion of Environment Synthesis?

Environment Synthesis creates a static training world before learning begins, whereas Environment Evolution continuously modifies or expands that world during training, allowing the environment to react to the agent’s progress.

**Fig. 10.** Overview of environment evolution paradigms. Existing methods are organized into three categories: Neural-Driven Evolution, which evolves environments through self-play or world models; Difficulty-Driven Evolution, which adapts task difficulty via explicit curriculum signals or implicit curriculum mechanisms; and Scaling-Driven Evolution, which expands environment diversity at the scenario or environment level.

**Table 11.** The statistics of Environment Evolution Methods. **Str** stands for structured text, **Err** for error messages, and **NA** means not applicable or unspecified.

Together these paradigms illustrate complementary routes for evolving environments: model‑based adaptation, curriculum‑based difficulty shaping, and breadth‑based scaling.

Read the original paper

Open the simplified reader on Paperglide