Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

Stable self-evolution for LLM agents requires principle-level experience, step-wise injection, and off-policy distillation.

How can we prevent performance degradation in LLM agents during iterative self-evolution when they internalize past experiences?

LLM agents often fail to sustain performance during iterative self-evolution, as existing experience internalization methods collapse rather than compound over multiple cycles. The authors identify three critical design dimensions—experience granularity, injection pattern, and internalization regime—that determine whether an agent can reliably transform past interactions into reusable parametric capability. By shifting to principle-level experience, step-wise injection, and off-policy distillation, the system maintains stable, compounding performance gains across successive self-evolution iterations.

Paper Primer

The core mechanism hinges on aligning experience with the agent's decision-making process rather than treating it as static background context. The authors replace global, trajectory-level experience injection with a step-wise selector that provides decision-relevant guidance at each interaction state, and they switch from on-policy to off-policy distillation to ensure the student learns from high-quality, coherent teacher trajectories.

Step-wise injection prevents the capability collapse seen in global injection methods.

Global injection models frequently default to premature terminal answers (63.82% of cases) because fixed context misaligns with intermediate decision states, whereas step-wise injection maintains tool-use coherence. Step-wise injection improved WebWalkerQA performance from 23.2% to 31.2% in self-generated experience settings.

Off-policy distillation provides a more stable and efficient training signal than on-policy distillation.

On-policy distillation forces the teacher to correct student-induced flawed states, leading to trajectory inflation (21.9 turns vs. 4.5 for the teacher), while off-policy distillation uses rejection-sampled, concise teacher trajectories. Off-policy distillation sustains performance across multiple iterations, whereas on-policy regimes degrade as self-evolution proceeds.

Why does "principle-level" experience outperform "instance-level" experience?

Instance-level experience is cluttered with trajectory-specific artifacts like URLs and concrete numbers that transfer poorly to new queries; principle-level experience abstracts these into reusable strategies and failure patterns that remain valid across different interaction distributions.

What is the "experience-use ability" and why is it a prerequisite for self-evolution?

It is the model's capacity to benefit from explicit experience context at inference time. If a model loses this ability during internalization, it cannot serve as an effective "experience-aware teacher" for the next iteration, breaking the self-evolution loop.

Researchers building self-evolving agents should prioritize off-policy distillation on principle-level experience to avoid the instability inherent in reactive, on-policy correction loops.

Introduction and Motivation

We expose why iterative self‑evolution degrades and outline three design shifts to stabilize it.

Iterative self‑evolution via on‑policy context‑distillation quickly degrades LLM agents because the model cannot keep internalized knowledge aligned with evolving tool‑use demands.

**Figure 1.** Performance degradation under iterative on-policy context-distillation.

It turns episodic interaction traces into permanent model parameters so the LLM can recall and reuse past knowledge without re‑feeding the original context.

How does experience internalization differ from simply fine‑tuning on the same trajectories?

Fine‑tuning treats the whole trajectory as a generic training example, while internalization explicitly extracts a reusable abstraction (the “experience”) and injects it back into the model at decision points, preserving the distilled strategy across future contexts.

The student model learns from a teacher that is generated by the student itself, so the supervision reflects the student’s current mistakes.

Why can on‑policy distillation be unstable across multiple iterations?

Because each iteration’s teacher is built from the student’s own trajectories, any mistake the student makes becomes part of the teacher’s supervision. Over time these mistakes reinforce each other, leading to a progressive capability collapse.

Experience Granularity determines whether we store whole trajectories (instance‑level) or abstracted strategies (principle‑level); the latter survives longer across iterations.

Experience Injection Pattern controls when the distilled knowledge is applied; step‑wise injection aligns experience with intermediate decisions, unlike a single global injection.

Internalization Regime selects the trajectory distribution for distillation; off‑policy distillation uses high‑quality teacher trajectories, providing a stable signal unlike on‑policy’s noisy student‑generated data.

The core problem is that iterative on‑policy distillation causes performance to collapse as internalized knowledge drifts from tool‑use requirements.

Agent Trajectories and Formulation

Formulating continual experience internalization and its key components.

ReAct agents interleave reasoning and tool use, but as they evolve they accumulate experience that can clash with the raw policy, causing performance to drift over successive iterations.

The agent repeatedly thinks, acts, and observes, building a chronological record that can be turned into reusable experience.

Step 1: History H₀ = (x). The agent produces thought $\tau$₁ = “Need factual info”, selects tool a₁ = Search, receives observation o₁ = “Paris is the capital of France”.

Step 2: History H₁ = (x, $\tau$₁, a₁, o₁). The agent thinks $\tau$₂ = “Answer the user”, chooses final‑answer action a₂ = Answer, and outputs “Paris”.

The trajectory H₂ = (x, ($\tau$₁,a₁,o₁), ($\tau$₂,a₂, —)) is summarized by the LLM into the experience entry e₁ = “Answered capital‑of‑France query via a search tool”.

The experience pool becomes E = {e₁}.

This concrete walk‑through shows how the reasoning step separates tool selection from the final answer, and how the resulting experience entry can later guide other agents without re‑issuing the tool.

How does ReAct differ from a standard reinforcement‑learning agent that directly maps observations to actions?

ReAct inserts an explicit reasoning stage (the thought $\tau$ₜ) before any action, and it records the full reasoning‑action‑observation chain as natural‑language experience. Standard RL agents typically learn a policy $\pi$(a | s) without such an intermediate, interpretable thought, making their behavior harder to audit and their experience harder to reuse.

Continual experience internalization treats the policy as a sequence of updates: at iteration k the current policy $\pi$$\theta^{(k)}$ generates trajectories, those trajectories are compressed into a pool $E^{(k)}$, and the same policy, now conditioned on $E^{(k)}$, teaches the next policy $\pi$$\theta^{(k+1)}$ via the Internalize operation.

Experience Granularity

We expose how granularity of experience determines the stability of internalization across iterations.

When agents internalize experience, the granularity of that experience determines whether the gain survives repeated self‑evolution. Fine‑grained, trajectory‑specific data often yields only fleeting improvements, while coarser, principle‑level abstractions provide a lasting learning signal.

Think of instance‑level experience as a single recipe that lists every ingredient and exact cooking time, whereas principle‑level experience is the underlying cooking technique that works across many recipes.

Count URL‑bearing items: 3 / 5 = 60 % (instance‑level).

Count numeric items: 2 / 5 = 40 % (instance‑level).

Count query‑specific strings: 3 / 5 = 60 % (instance‑level).

Apply the principle‑level filter: keep only items that express a reusable rule (none in this tiny set), yielding 0 / 5 = 0 % principle‑level.

Result: the filtered principle‑level set discards all local artifacts, leaving a clean, strategy‑only signal.

Principle‑level filtering dramatically shrinks the data while preserving the high‑level decision logic that generalizes across future queries.

How does principle‑level experience differ from instance‑level experience beyond just “being less detailed”?

Instance‑level items keep every concrete token (e.g., a specific URL), so they act like memorized examples. Principle‑level items replace those tokens with abstract rules (e.g., “prefer reputable sources”), which means the model learns a reusable heuristic instead of a literal string.

**Figure 2.** Effect of Experience Granularity on Qwen3-4B-Instruct-2507 under iterative on-policy context-distillation. Dashed lines denote base and in-context performance.

Experience Injection Patterns

Step‑wise injection feeds state‑relevant experience to the teacher, outperforming global injection.

We now examine how experience should be injected into the teacher prompt. Fixing the granularity to principle‑level experience, we compare two injection patterns under on‑policy context‑distillation.

Step‑wise injection feeds the teacher with experience that is selected for the current interaction state, rather than a fixed trajectory‑level context.

How does step‑wise injection differ from simply appending the entire experience pool to the prompt at every step?

Appending the whole pool treats experience as a static background, so the teacher cannot weight the parts that matter for the current decision. Step‑wise injection actively selects the most relevant experience for the current history, producing a teacher distribution that changes with the state.

Step 1 (search planning): step‑wise injection selects $E_1$, giving a teacher distribution $p_{\text{step},1} = (0.8,0.2)$. KL$=0.5\log\frac{0.5}{0.8}+0.5\log\frac{0.5}{0.2}\approx0.22$.

Step 2 (evidence verification): step‑wise injection selects $E_2$, yielding $p_{\text{step},2} = (0.3,0.7)$. KL$=0.5\log\frac{0.5}{0.3}+0.5\log\frac{0.5}{0.7}\approx0.09$.

Step 3 (answer generation): no specific experience is needed, so $p_{\text{step},3} = (0.5,0.5)$ and KL$=0$.

Global injection keeps the same teacher distribution at every step, $p_{\text{glob},t} = (0.5,0.5)$, so KL$=0$ for all steps.

Step‑wise injection creates a non‑uniform teacher that provides informative supervision exactly where the decision matters, whereas global injection offers no such signal.

**Figure 3.** Effect of Experience Injection Pattern on Qwen3-4B-Instruct-2507 under iterative on-policy context-distillation. Dashed lines denote base performance.

**Figure 4.** Case study of **premature answering** under global injection. After iterative training, the model trained with global injection terminates without invoking search tools, whereas step-wise injection preserves evidence-seeking tool use before answering.

Across multiple iterations, step‑wise injection sustains gains while global injection quickly degrades, confirming the importance of state‑specific experience selection.

Why Step-wise Injection Works

We compare step‑wise versus global injection to see how each affects continual experience internalization.

We compare step‑wise injection against global injection to assess their impact on the model’s ability to retain and use experience over successive self‑evolution iterations.

Instead of feeding the entire experience pool at once, the model injects only the experience that is relevant to the current interaction step.

How does step‑wise injection differ from global injection?

Global injection supplies the full experience context regardless of the current state, often exposing irrelevant terminal cues. Step‑wise injection, by contrast, picks experience that matches the agent’s present interaction, ensuring the injected signal remains decision‑relevant and preventing premature answers.

Empirically, step‑wise models continue to benefit from experience across self‑evolution iterations, whereas global‑injection models degrade both with and without experience context.

The failure of global injection stems from a mismatch between the fixed experience context and the evolving decision state: the teacher receives the same experience throughout a trajectory, so useful later‑stage information may be exposed too early, while current‑relevant cues are ignored.

The iterative self‑evolution mechanism relies on step‑wise injection to preserve experience‑use ability and avoid premature answers.

Internalization Regime

We reveal how off‑policy distillation stabilizes self‑evolution despite prior degradation.

Iterative self‑evolution tends to degrade because on‑policy context‑distillation forces the student to learn from its own imperfect rollouts. Shifting to principle‑level experience, step‑wise injection, and off‑policy distillation can restore stability.

We therefore examine how the choice of trajectory distribution—on‑policy versus off‑policy—impacts the coherence of supervision and the cost of rollouts.

Instead of letting the student generate a trajectory and then correcting it, the teacher first creates a full, experience‑guided rollout and we keep only the successful ones for the student to imitate.

How does off‑policy context‑distillation differ from the on‑policy variant?

On‑policy first runs the current student (which lacks experience) to produce a trajectory, then the teacher reacts to those states; off‑policy lets the teacher generate the entire trajectory up front, so the student learns from a fully guided, high‑quality rollout rather than from reactive corrections.

Trajectory A: steps (a₁, a₂, a₃) → success 0 (rejected).

Trajectory B: steps (b₁, b₂, b₃) → success 1 (accepted).

Trajectory C: steps (c₁, c₂, c₃) → success 0 (rejected).

The student updates its policy using only the accepted trajectory B, copying the three steps directly.

Training on the accepted, teacher‑generated rollout gives the student a concise, experience‑rich example, avoiding the long, noisy trajectories that on‑policy updates would produce.

**Figure 5.** Effect of Internalization Regime across self-evolution iterations. We compare off-policy context-distillation with on-policy context-distillation under principle-level experience and step-wise injection on Qwen3-4B-Instruct-2507 and Qwen3-8B. Dashed lines denote the base model without experience internalization.

The two regimes also diverge in rollout cost because trajectory length determines interaction overhead. On‑policy updates inflate trajectories dramatically, whereas off‑policy keeps them short.

Stable Multi-Iteration Self-Evolution

Stable self‑evolution yields consistent gains across iterations.

Iterative self‑evolution can degrade because on‑policy distillation conflicts with tool use. Shifting to principle‑level experience, step‑wise injection, and off‑policy distillation stabilizes the process.

Stable self‑evolution with principle‑level experience, step‑wise injection, and off‑policy distillation improves the internalized model by an average of 4.2 % over the vanilla base after three iterations.

Figure 6 shows consistent gains across WebWalkerQA, GAIA, and BrowseComp‑ZH, with the internalized variant reaching the highest scores at iteration 3.

**Figure 6.** Self-evolution performance of Qwen3-4B-Instruct-2507 under our final setting. Cyan bars denote internalized inference without inference-time experience, while red bars denote in-context experience use with the corresponding experience pool. The results show that our setting sustains performance gains across self-evolution iterations and preserves the model's ability to benefit from explicit experience.

**Figure 7.** Experience internalization and in-context experience use under DeepSeek-generated principle-level experience and off-policy context-distillation. Top panels use global injection, and bottom panels use step-wise injection. Cyan bars denote internalized inference without inference-time experience, while red bars denote performance with the corresponding experience pool provided in context.

**Figure 8.** Experience internalization and in-context experience use under global injection with principle-level self-generated experience and off-policy context-distillation. Cyan bars denote internalized inference without inference-time experience, while red bars denote performance with the corresponding experience pool provided in context.

**Table 4.** Self-evolution results under different experience sources, injection patterns, and distillation regimes.

Related Work

We situate our approach among prior experience‑learning and self‑evolving LLM work.

Context‑Based Experience Learning treats the trajectories of LLM agents as a reusable resource. Prior work organizes these methods into three families: storage (preserving raw trajectories for later retrieval), reflection (refining stored experience via self‑feedback), and abstraction (distilling experience into reusable skills or summarized knowledge). All of these retain experience only as inference‑time context, so their benefit is limited by the model’s in‑context learning capacity and can suffer from context collapse as the experience pool grows.

Experience Internalization refers to embedding interaction experience into model parameters; it is typically achieved through context distillation. Early approaches used off‑policy distillation, training a student on teacher‑generated trajectories, but they often suffer a training–inference mismatch. More recent work adopts on‑policy distillation, supervising trajectories sampled from the student itself to improve distributional consistency, yet these studies focus on a single transfer round and leave the stability of multi‑iteration internalization unexplored.

Self‑evolving LLM agents iteratively improve by leveraging interaction data, feedback, and self‑generated experience. Existing methods fall into policy‑level approaches that update the whole agent model from trajectories and feedback, and component‑level approaches that evolve external structures such as memory, tools, or experience libraries. Recent closed‑loop systems jointly train the model while refreshing the experience pool, highlighting the need for robust experience representation and internalization to sustain the improvement loop across multiple iterations.

Implementation Details

Full implementation settings and complete self‑evolution results are provided.

Our agent follows the ReAct‑style interaction described in Section 3. At each step it emits either a tool call or a terminal answer, using five tools: Search, Visit, Python, Scholar, and File Parser. Interactions are limited to $T_{\text{max}} = 100$ steps with a context window of $32{,}768$ tokens.

Training trajectories are drawn from a $15\text{K}$‑example web‑reasoning corpus (Section 4). For on‑policy context‑distillation, the current student model generates trajectories that the experience‑aware teacher supervises; for off‑policy distillation, the teacher generates trajectories which are then filtered by rejection sampling before training.

Experience is extracted and selected using DeepSeek‑V4 to summarize trajectories into natural‑language experience; in the Qwen self‑generated setting the student‑side Qwen model performs this extraction. Instance‑level experience retains full trajectory observations, while principle‑level experience abstracts reusable strategies, search principles, and failure patterns.

All distillation training is implemented with VerL (Sheng et al., 2025). Students are optimized with AdamW at a learning rate of $1\times10^{-5}$, batch size $128$, for $5$ epochs on $8\times$ NVIDIA A800 GPUs. On‑policy distillation uses student‑induced trajectories with teacher supervision at each step; off‑policy distillation trains on the rejection‑filtered teacher‑generated trajectories.

Table 4 reports the full self‑evolution results across experience sources, injection patterns, distillation regimes, and model backbones. The data confirm that step‑wise injection is more stable than global injection across iterations, and that off‑policy context‑distillation yields stronger multi‑iteration performance under the same principle‑level, step‑wise setting.

Self‑evolution runs for three internalization iterations. In each iteration the current model generates trajectories, the trajectories are summarized into an updated experience pool, and the resulting experience‑conditioned behavior is distilled into the next model. Unless noted otherwise, each iteration refreshes the experience pool using trajectories generated by the current model.

At inference time models are evaluated without in‑context experience unless explicitly marked; generation uses temperature $0.7$. WebWalkerQA and BrowseComp‑ZH are evaluated with one rollout per query (reported as Pass@1), while GAIA‑Text‑103 is evaluated over three rollouts per query (average accuracy reported).

Figures 7 and 8 analyze experience‑use ability across self‑evolution iterations. With DeepSeek‑generated principle‑level experience and off‑policy distillation, global injection shows unstable in‑context experience use, whereas step‑wise injection maintains stronger internalized performance and better preserves the model’s ability to benefit from explicit experience.

The configuration tables list internalized versus in‑context inference results for Qwen 3‑4B‑Instruct‑2507 and Qwen 3‑8B‑Instruct under various experience sources (Qwen‑generated, DeepSeek‑generated) and injection patterns (global, step‑wise) with both on‑policy and off‑policy distillation across three iterations.

Read the original paper

Open the simplified reader on Paperglide