EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

EvoMem preserves memory evolution as a patch history, enabling agents to reason across changing environment versions.

How can LLM agents maintain performance in environments where tasks and user preferences evolve over time, rather than remaining static?

LLM agents are typically evaluated on static snapshots, causing them to fail when real-world environments—like software codebases or terminal workflows—evolve over time. EvoMem addresses this by augmenting standard memory with an append-only patch history that records what changed, why it changed, and the evidence supporting the update. This approach consistently improves performance across evolving benchmarks, yielding an average 3.7% gain in chain-level accuracy by allowing agents to selectively retrieve relevant prior states.

Paper Primer

Most agents consolidate memory into a single "latest" state, which effectively discards the context of previous versions. When an environment update overwrites a rule that might still be valid for a rollback or a different organization, the agent loses the ability to reason about why that rule was changed or when it was applicable.

EvoMem is a version-aware memory paradigm: it treats memory updates as a git-like evolution trace. By storing non-additive changes as explicit patches—containing the pre-update state, post-update state, rationale, and triggering evidence—the system allows agents to retrieve historical context alongside their current memory when a query depends on temporal changes or overwritten information.

EvoMem significantly improves agent robustness in persistent, evolving environments.

Across the EvoArena benchmark suite, EvoMem achieved an average 1.5% gain in step-level accuracy and a 3.7% gain in chain-level accuracy. The chain-level improvement is particularly pronounced in Terminal-Bench-Evo, where accuracy gains increased from 2.4% at the step level to 6.1% at the chain level.

EvoMem generalizes to standard, non-evolving agent benchmarks.

The paradigm improved performance on GAIA by 6.1% and LoCoMo by 4.8%. These results suggest that preserving update history provides a general benefit for long-horizon agent reasoning, even outside of explicitly versioned environments.

Why does the paper argue that "latest-state" memory is insufficient for agents?

In dynamic environments, knowledge is often version-dependent. Overwriting old memory with the latest state erases the context of why a change occurred, preventing the agent from recovering prior behaviors that may still be valid for different versions or rollbacks.

What is the core difference between EvoArena and existing agent benchmarks?

EvoArena evaluates persistent environment evolution, where the same setting changes across a chain of versions. Unlike static benchmarks or those that only refresh tasks, EvoArena requires agents to track what changed, what remains valid, and how to adapt to new releases.

EvoMem does not replace the base agent's memory updater; it acts as a non-invasive wrapper that monitors memory transitions and creates an append-only patch history, making it compatible with diverse architectures like Terminus2, OpenHands, and A-Mem.

Reliable agent deployment requires moving beyond static snapshots; memory should be treated as an evolving history of grounded updates rather than a single mutable store.

Introduction: The Challenge of Dynamic Environments

LLM agents falter in evolving environments, prompting EvoArena and EvoMem to enable robust, version‑aware reasoning.

Current evaluations treat environments as immutable snapshots, but real deployments require agents to adapt to continuously changing interfaces, rules, codebases, and user preferences. Consequently, static‑memory agents falter, prompting the need for a paradigm that treats memory as an evolving history.

Environment Evolution denotes a sequence of successive releases of the same underlying task where interfaces, rules, or preferences change while the core goal remains constant.

EvoArena operationalizes Environment Evolution by chaining releases of terminal, software, and social‑preference environments, while EvoMem augments standard memory with an append‑only patch history that records pre‑ and post‑update states, rationales, and supporting evidence.

The shift from static to evolving benchmarks forces agents to treat memory as a versioned history rather than a single snapshot.

The EvoArena Benchmark

We situate EvoArena among prior benchmarks and outline its three evolving domains.

Recent benchmarks have begun to model realistic interaction settings, but most treat the environment as static. Consequently they cannot assess an agent’s ability to preserve valid prior behavior while adapting to new changes.

EvoArena is a benchmark suite that evaluates agents under persistent environment evolution by chaining multiple versions of the same task.

Terminal‑Bench‑Evo is a chain of executable terminal workflows where each version keeps the same end goal but modifies files, paths, permissions, or deployment policies.

SWE‑Chain‑Evo is a series of software‑evolution milestones where each step adds a concrete patch to a growing repository snapshot.

PersonaMem‑Evo models long‑horizon preference evolution by presenting a multi‑turn conversation where user preferences shift over time.

**Figure 2.** **EvoArena construction.** We convert static agent benchmarks into versioned evolution chains across executable workflows, software engineering, and social intelligence, testing whether agents can adapt to new changes while preserving still-valid prior behavior.

**Figure 3.** Distribution of EvoArena. The central circle shows domain proportions, and the surrounding panels show the question type distribution within each domain.

Table 2 reports key statistics of the three EvoArena subsets, including chain lengths, difficulty breakdowns, and test counts.

The EvoMem Memory Paradigm

Static memory loses historic behavior; EvoMem records updates as patches and retrieves them when needed.

When an agent overwrites its memory with the newest observation, earlier rules that remain valid for older releases or other organizations disappear. In dynamic settings, behavior often depends on which version of a rule was in effect. EvoMem solves this by keeping a trace of meaningful updates instead of discarding them.

EvoMem augments any existing memory system with an append‑only log of “patches” that capture what changed, why it changed, and the evidence that triggered the change.

How is EvoMem different from simply keeping the latest memory snapshot?

EvoMem does not discard overwritten information. Instead of replacing the old state, it records the overwritten fields, the rationale, and supporting evidence as a separate patch, which can later be consulted when a query depends on that historic version.

Compute the diff $\Delta_t = \text{Diff}(M_{t-1},M_t)$, which reports “rule B changed from v1 to v2”.

Form a patch $p_t$ that stores the timestamp $\tau_t$, the before‑content $C^{-}_t$ (rule B = v1), the after‑content $C^{+}_t$ (rule B = v2), the rationale $r_t$ (e.g., “new API requires v2”), a short summary $z_t$ (“upgrade rule B”), and evidence $e_t$ (the command that caused the change).

Append $p_t$ to the immutable patch log $P_{1:t}$.

Even though the latest memory now shows rule B = v2, the patch preserves the fact that v1 existed and why it was superseded, allowing the agent to answer queries about the older version.

A patch is a structured record that captures a non‑additive memory update: when something is overwritten, the patch stores the old value, the new value, the reason, and any supporting evidence.

Diff reports $C^{-}_t$ = “alice@example.com”, $C^{+}_t$ = “alice@newdomain.com”.

Rationale $r_t$ is “user updated email after domain migration”.

Summary $z_t$ is “email update”.

Evidence $e_t$ stores the UI interaction log that captured the edit.

The patch $p_t$ is appended to $P_{1:t}$, preserving the old address for any legacy processes that still expect the former domain.

Later queries that need to send a message to the old address can retrieve this patch and understand that the change was intentional, not a data loss.

Given a query $q$, retrieve the standard context $c_{\text{mem}}$ from the latest memory $M_T$ using the base retriever $R_{\text{mem}}$.

Query the patch log $P_{1:T}$ with $R_{\text{patch}}$ to obtain the top‑$k$ relevant patches $P_q$.

Concatenate the base context and the retrieved patches: $c(q)=\text{Concat}(c_{\text{mem}},P_q)$.

Pass $c(q)$ to the downstream agent for response generation or action selection.

**Figure 7.** Overview of EvoMem. EvoMem augments a base memory system with an append-only patch history that records behaviorally meaningful memory updates and retrieves relevant patches as versioned evidence at inference time.

Experimental Results

EvoMem boosts step accuracy by 2.6% and chain accuracy by 3.7% on average.

EvoMem raises average step accuracy by 2.6% across all evaluated benchmarks.

Table 3 shows step gains of +2.4% (TERMINAL‑BENCH‑EVO), +0.5% (SWE‑CHAIN‑EVO), +1.8% (PERSONAMEM‑EVO); Table 4 shows gains of +6.1% (GAIA) and +4.8% (LOCOMO). Table 3 also reports chain gains of +6.1% (TERMINAL‑BENCH‑EVO), +2.9% (SWE‑CHAIN‑EVO), +3.0% (PERSONAMEM‑EVO).

**Figure.** A scatter plot comparing Chain-level Accuracy (%) on the y-axis and Step-level Accuracy (%) on the x-axis for various AI models. A dashed diagonal line labeled "step = chain" indicates parity between the two metrics. Models are represented by logos and labels, with specific performance values listed below each. To the right, two text boxes define "Step-level Acc." as solving individual evolved tasks for local adaptation, and "Chain-level Acc." as solving full evolution chains for sustained reliability.

EvoMem consistently improves performance across evolving chains.

Analysis: When and Why EvoMem Helps

Ablation analysis shows when EvoMem improves agents and why.

EvoMem’s benefits are examined through targeted ablations that isolate the effect of each memory‑related component.

EvoMem gains a +6.5% accuracy boost when a patch example is retrieved versus when no example is available.

Table 5 shows EvoMem accuracy 65.3% with a retrieved patch example compared to 41.2% without, a +6.5% improvement.

High evolved‑requirement coverage yields a +5.3% gain over low coverage.

Table 5 reports 87.5% accuracy for high coverage versus 80.6% for low, a +5.3% difference.

When patch uptake is non‑zero, EvoMem’s gain rises to +6.2% compared with +3.1% when uptake is zero.

Table 5 lists a +6.2% gain for instances with patch uptake > 0 versus +3.1% when no uptake occurs.

Command‑level patch uptake creates a +5.7% gain gap between conditions.

Table 5’s “Gain gap” column shows a +5.7% difference between the no‑uptake and uptake‑>0 rows for command‑level patches.

EvoMem reduces `PASS_TO_PASS` regression failures by an average of –2.77 percentage points.

Table 6 reports an average regression rate of 9.09% for the Base configuration and 6.32% with +EvoMem, a –2.77% improvement.

On temporal‑trajectory questions, EvoMem improves accuracy by +5.8%.

Table 7 shows accuracy rising from 28.6% (Base) to 44.4% (+EvoMem), a +5.8% gain.

Overall, EvoMem raises PERSONAMEM‑EVO accuracy by +2.0%.

Table 7’s overall row moves from 40.5% to 42.5%, a +2.0% increase.

Row‑level evidence capture improves by +2.4% with EvoMem.

Table 8 reports row‑level capture of 72.5% (Base) versus 74.9% (+EvoMem), a +2.4% gain.

**Figure 8.** Accuracy versus total token usage across evaluated backbone models. Total token usage is measured in millions of tokens, and lower usage indicates higher inference efficiency. Dashed lines mark the cross-model averages.

These ablations collectively reveal that EvoMem is most beneficial when retrieved patches are actually incorporated into reasoning or concrete commands, and that preserving coherent versioned evidence translates into measurable gains across diverse tasks.

Limitations and Future Directions

Limits, impact, and dataset construction details for EvoArena and EvoMem.

Our benchmark isolates three evolution modes—executable workflow changes, software‑repo updates, and long‑horizon preference shifts—to stress version‑aware capabilities such as interface adaptation, code‑base reasoning, and temporal grounding.

These forms of environment evolution appear across robotics, scientific pipelines, and multi‑agent systems, so extending EvoArena to those domains would let the community probe reliability under physical‑state, protocol, and role changes.

The broader impact of this work is two‑fold: it supplies a realistic stress test for LLM agents and it offers EvoMem, a lightweight patch‑history mechanism that makes memory updates inspectable.

Potential risks include stronger long‑term strategies that could be misused, and privacy concerns because patch histories may retain sensitive user data if not properly pruned.

We therefore recommend strict access controls, data‑minimization policies, and audit trails when deploying EvoMem in production.

**Table 9.** PersonaMem-Evo performance breakdown by reasoning type and difficulty level using GPT-5.5.

**Algorithm 1** Construction of Terminal-Bench-Evo **Require:** Original Terminal-Bench tasks $\mathcal{X} = \{x_1, \dots, x_{89}\}$; evolution taxonomy $\mathcal{G}$; validation executor $\mathcal{V}$ **Ensure:** Terminal-Bench-Evo benchmark $\mathcal{B}$ 1: $\mathcal{B} \leftarrow \emptyset$ 2: **for** each original task $x_i \in \mathcal{X}$ **do** 3: $a_i \leftarrow \text{ANALYZE}TASK(x_i)$ $\triangleright$ Extract objective, environment, interfaces, I/O contracts, dependencies, and tests 4: $\Pi_i \leftarrow \text{CONSTRUCTEVOLUTIONPLAN}(x_i, a_i, \mathcal{G})$ 5: $E_i^{(0)} \leftarrow \text{INITIALIZEENVIRONMENT}(x_i)$ 6: $C_i \leftarrow \emptyset$ 7: **for** $t = 1$ to $|\Pi_i|$ **do** 8: $\pi_i^{(t)} \leftarrow \Pi_i[t]$ 9: $(I_i^{(t)}, E_i^{(t)}, T_i^{(t)}, M_i^{(t)}) \leftarrow \text{REALIZEVERSION}(E_i^{(t-1)}, \pi_i^{(t)})$ $\triangleright$ Instantiate instruction, environment, tests, and metadata 10: **if** $\text{VALIDATEVERSION}(I_i^{(t)}, E_i^{(t)}, T_i^{(t)}, \mathcal{V})$ **then** 11: $v_i^{(t)} \leftarrow (I_i^{(t)}, E_i^{(t)}, T_i^{(t)}, M_i^{(t)})$ 12: $C_i \leftarrow C_i \oplus v_i^{(t)}$ 13: **end if** 14: **end for** 15: $C_i \leftarrow \text{VERIFYCHAINCONSISTENCY}(C_i)$ 16: $\mathcal{B} \leftarrow \mathcal{B} \cup \{C_i\}$ 17: **end for** 18: **return** $\mathcal{B}$

**Table 10.** Distribution of evolution categories in Terminal-Bench-Evo. Counts are computed over non-initial version updates, excluding the first version of each workflow chain.

**Table 11.** Grouped evolution statistics for Terminal-Bench-Evo. Groups aggregate the fine-grained categories in Table 10.

**Table 12.** Summary statistics of Terminal-Bench-Evo.

Conclusion

Conclusion summarizing contributions and providing the full bibliography.

We introduced EvoArena, a benchmark for evaluating LLM agents in environments where tasks, codebases, workflows, and user preferences evolve over time, and EvoMem, a patch‑based memory paradigm that records updates with context. Experiments show EvoMem consistently improves performance across EvoArena and standard agent benchmarks, better preserving evolving evidence.

[1] Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional planning and reasoning‑based common knowledge work tasks. In The Thirty‑eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.

[2] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production‑ready AI agents with scalable long‑term memory. In ECAI 2025 – 28th European Conference on Artificial Intelligence, 25‑30 October 2025, Bologna, Italy – Including 14th Conference on Prestigious Applications of Intelligent Systems (PAIS 2025), pages 2993–3000, 2025.

[3] Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean‑Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Menard, Gerard Moreno‑Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, and Thomas Scialom. Gaia2: Benchmarking LLM agents on dynamic and asynchronous environments. In The Fourteenth International Conference on Learning Representations, 2026.

[4] Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094, 2024.

[5] Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026.

[6] Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem‑v2: Towards personalized intelligence via learning implicit user personas and agentic memory. arXiv preprint arXiv:2512.06688, 2025.

[7] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE‑bench: Can language models resolve real‑world GitHub issues? In The Twelfth International Conference on Learning Representations, 2024.

[8] LangChain. Langgraph documentation. https://docs.langchain.com/oss/python/langgraph, 2024. Accessed: 2026‑05‑05.

[9] Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, et al. Repolaunch: Automating build & test pipeline of code repositories on any language and any platform. arXiv preprint arXiv:2603.05026, 2026.

[10] Shuyue Stella Li, Bhargavi Paranjape, Kerem Oktar, Zhongyao Ma, Gelin Zhou, Lin Guan, Na Zhang, Sem Park, Lin Chen, Diyi Yang, et al. Horizonbench: Long‑horizon personalization with evolving preferences. arXiv preprint arXiv:2604.17283, 2026.

[11] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations, 2024.

[12] Adyasha Maharana, Dong‑Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long‑term conversational memory of LLM agents. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 13851–13870, 2024.

[13] Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal‑bench: Benchmarking agents on hard, realistic tasks in command line interfaces. In The Fourteenth International Conference on Learning Representations, 2026.

[14] Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Representations, 2024.

[15] Moonshot AI. Kimi k2.6. https://www.kimi.com/ai-models/kimi-k2-6, 2026.

[16] OpenAI. Introducing gpt‑5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/, 2026.

[17] OpenAI. Introducing gpt‑5.5. https://openai.com/index/introducing-gpt-5-5/, 2026.

[18] Qwen Team. Qwen3.6‑27b: Flagship‑level coding in a 27b dense model. https://qwen.ai/blog?id=qwen3.6-27b, 2026.

[19] Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary‑beth Fair, Alice Li, William E Bishop, Wei Li, Folawiyo Campbell‑Ajala, Daniel Kenji Toyama, Robert James Berry, Divya Tyamagundlu, Timothy P Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents. In The Thirteenth International Conference on Learning Representations, 2025.

[20] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Thirty‑seventh Conference on Neural Information Processing Systems, 2023.

[21] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open‑ended embodied agent with large language models. Transactions on Machine Learning Research, 2024.

[22] Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuanjing Huang, and Zhongyu Wei. Benchmark self‑evolving: A multi‑agent framework for dynamic LLM evaluation. In Proceedings of the International Conference on Computational Linguistics, pages 3310–3328, 2025.

[23] Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Learning Representations, 2025.

[24] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A‑mem: Agentic memory for LLM agents. In The Thirty‑ninth Annual Conference on Neural Information Processing Systems, 2025.

[25] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.

[26] Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento‑skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026.

[27] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.

Benchmark Statistics

Detailed dataset statistics and construction algorithms for SWE‑Chain‑Evo and PersonaMem‑Evo.

This appendix records the quantitative properties of the SWE‑Chain‑Evo and PersonaMem‑Evo benchmarks and the procedural algorithms used to build them.

**Table 14.** Summary statistics of SWE-Chain-Evo. Patch statistics are computed from the solution code patch. Changed lines count added plus deleted lines, excluding diff headers.

**Table 15.** Chain-length distribution in SWE-Chain-Evo.

**Table 16.** Five change families used to construct multi-step preference trajectories.

**Table 17.** Scale, question-type, and difficulty statistics for the 10-persona PERSONAMEM-EVO subset.

**Table 18.** Number of source preferences required per question.

**Table 19.** Source-preference composition and change-family statistics. Counts are mention-level. (a) Source-preference composition. (b) Static and changed preferences.

**Require:** Repository $r$; commit history $H_r$; validation pipeline $\mathcal{V}$ **Ensure:** Milestone set $M_r$ 1: $M_r \leftarrow \emptyset$ 2: $W_r \leftarrow \text{EXTRACTCONTINUOUSWINDOWS}(H_r)$ 3: **for** each update window $w \in W_r$ **do** 4: $G_w \leftarrow \text{GROUPBYCOMMITSEMANTICS}(w)$ 5: **for** each candidate commit group $g \in G_w$ **do** 6: $\Delta_g \leftarrow \text{INSPECTCODECHANGES}(g)$ 7: **if** $\neg \text{COHERENTOBJECTIVE}(g, \Delta_g)$ **then** 8: **continue** 9: **end if** 10: $g' \leftarrow \text{FILTERIRRELEVANTCHANGES}(g, \Delta_g)$ 11: **if** $\text{AMBIGUOUSORUNRELIABLE}(g')$ **then** 12: **continue** 13: **end if** 14: $q \leftarrow \text{SYNTHESIZETASKDESCRIPTION}(g')$ 15: **if** $\text{MANUALVERIFY}(q, g')$ and $\text{EXECUTABLEANDSTABLE}(q, g'; \mathcal{V})$ **then** 16: $M_r \leftarrow M_r \cup \{(q, g')\}$ 17: **end if** 18: **end for** 19: **end for** 20: **return** $M_r$

**Algorithm 1: OOD Question Generation with Dual Blind Filtering** **Require:** Persona $u_i$; long-context history $H_i$; generator $\mathcal{G}$; validator $\mathcal{V}$ **Ensure:** Validated OOD question set $Q_i^{ood}$ 1: $Q_i^{ood} \leftarrow \emptyset$ 2: Define OOD types $\mathcal{T}$ and complexity levels $\mathcal{D}$ 3: **for** each question type $t \in \mathcal{T}$ **do** 4: **for** each complexity level $d \in \mathcal{D}$ **do** 5: **for** $r = 1$ to $\text{TARGETCOUNT}(t, d)$ **do** 6: $q \leftarrow \text{DRAFTOODQUESTION}(u_i, H_i, t, d; \mathcal{G})$ 7: **if** $q = \emptyset$ **then** 8: **continue** 9: **end if** 10: $q \leftarrow \text{BUILDANSWEROPTIONS}(q; \mathcal{G})$ 11: **if** $\neg \text{BALANCEDOPTIONS}(q)$ **then** 12: **continue** 13: **end if** 14: **if** $\text{PERSONABLINDCORRECT}(q, u_i; \mathcal{V})$ **then** 15: **continue** 16: **end if** 17: **if** $\text{NOCONTEXTCORRECT}(q; \mathcal{V})$ **then** 18: $q \leftarrow \text{ADVERSARIALREWRITE}(q; \mathcal{G})$ 19: **if** $q = \emptyset$ or $\text{NOCONTEXTCORRECT}(q; \mathcal{V})$ **then** 20: **continue** 21: **end if** 22: **end if** 23: $Q_i^{ood} \leftarrow Q_i^{ood} \cup \{q\}$ 24: **end for** 25: **end for** 26: **end for** 27: **return** $Q_i^{ood}$

Questions & answers

What are EvoArena and EvoMem, and what is their main contribution?

EvoArena is a benchmark that evaluates LLM agents across chains of evolving environments (terminal workflows, software codebases, and social-preference settings), and EvoMem is a patch-based memory paradigm that augments standard agent memory with an append-only history of versioned updates. Together they address the gap left by static benchmarks that cannot assess whether agents can track what changed, what remains valid, and how to adapt across releases.

What problem does this paper address and why does it matter?

The paper addresses the failure of LLM agents when real-world environments—such as software codebases or terminal workflows—evolve over time, while current evaluations only use static snapshots. Because agents that overwrite memory with the latest state lose the context of prior versions, they cannot reason about rollbacks, version-specific rules, or why a change occurred.

Why is 'latest-state' memory insufficient for agents in dynamic environments?

In dynamic environments, knowledge is version-dependent: overwriting old memory with the newest state erases the rationale behind a change, preventing the agent from recovering prior behaviors that may still be valid for different versions or rollbacks. EvoMem retains this context by storing the pre-update state, post-update state, rationale, and triggering evidence as explicit patches.

How does EvoMem work technically?

EvoMem acts as a non-invasive wrapper around an agent's existing memory updater, monitoring memory transitions and creating an append-only patch history. Each patch records the overwritten fields, the post-update state, the rationale for the change, and supporting evidence, which can later be selectively retrieved when a query depends on a historical version.

How is EvoArena different from existing agent benchmarks?

EvoArena evaluates persistent environment evolution by chaining releases of terminal, software-repo, and social-preference environments, requiring agents to track what changed, what remains valid, and how to adapt to new releases. Existing benchmarks either treat environments as static or only refresh tasks without modeling version-to-version continuity.

What benchmarks and datasets are used in the experiments?

The paper evaluates on five benchmarks: Terminal-Bench-Evo, SWE-Chain-Evo, PersonaMem-Evo, GAIA, and LoCoMo. EvoArena itself comprises three subsets whose statistics (chain lengths, difficulty breakdowns, and test counts) are reported in Table 2 of the paper.

Which agent architectures does EvoMem augment?

EvoMem augments four distinct agents: Terminus2, OpenHands, Memento-Skill, and A-Mem, storing compact sanitized transition patches and retrieving only relevant patches to guide subsequent tasks while preserving each agent's original control loop.

What are the key quantitative results?

EvoMem consistently improves performance across evolving benchmarks, yielding an average 3.7% gain in chain-level accuracy by allowing agents to selectively retrieve relevant prior states. The paper does not report per-benchmark breakdowns beyond this aggregate figure in the provided content.

What do the ablation studies reveal about EvoMem?

Ablations show that EvoMem is most beneficial when retrieved patches are actually incorporated into reasoning or concrete commands, and that preserving coherent versioned evidence translates into measurable gains across diverse tasks. The ablations isolate the effect of each memory-related component.

What are the limitations of EvoArena and EvoMem?

EvoArena currently isolates only three evolution modes—executable workflow changes, software-repo updates, and long-horizon preference shifts—and has not been extended to robotics, scientific pipelines, or multi-agent systems. The paper also notes privacy risks because patch histories may retain sensitive user data if not properly pruned, and recommends strict access controls, data-minimization policies, and audit trails for production deployment.

How does EvoMem compare to prior memory approaches?

Unlike standard memory systems that consolidate updates into a single mutable 'latest' state, EvoMem treats memory as a git-like evolution trace by storing non-additive changes as explicit patches. The paper positions EvoMem as compatible with diverse existing architectures (e.g., Terminus2, OpenHands, A-Mem) rather than replacing them.

What types of environment evolution does EvoArena cover?

EvoArena covers three evolution modes: executable workflow changes (terminal-based tasks), software-repository updates (code-base reasoning), and long-horizon preference shifts (social-preference environments). The paper notes that analogous forms of evolution appear in robotics, scientific pipelines, and multi-agent systems, which are identified as future directions.

What are the broader impacts and risks identified by the paper?

The paper identifies two positive impacts: supplying a realistic stress test for LLM agents and offering EvoMem as a lightweight, inspectable patch-history mechanism. Risks include potential misuse of stronger long-term strategies and privacy concerns from patch histories retaining sensitive user data.

How can EvoMem be reproduced or applied to a new agent?

EvoMem is implemented as a non-invasive wrapper that monitors memory transitions without modifying the base agent's control loop, making it applicable to diverse architectures. The paper specifies experimental settings for five benchmarks—including environment snapshots, memory-context budgets, and retrieval configurations—to ensure fair comparison between baseline agents and EvoMem-enhanced counterparts.

Where was this paper published and who are the authors?

The paper does not explicitly list author names or a publication venue in the provided content; it is available on arXiv at https://arxiv.org/abs/2606.13681.

Key terms

EvoArena: A benchmark that evaluates LLM agents in environments that persistently evolve across a chain of versions, covering terminal workflows, software repositories, and social-preference settings.
EvoMem: A version-aware memory paradigm that augments standard agent memory with an append-only patch history recording pre- and post-update states, rationales, and supporting evidence for each memory change.
patch history: An append-only log of discrete memory updates, each capturing what information was overwritten, what replaced it, why the change occurred, and what evidence triggered it.
chain-level accuracy: A performance metric that measures how correctly an agent completes tasks across a sequential chain of evolving environment versions.
latest-state memory: A conventional memory design in which each update overwrites the previous state, retaining only the most recent version of stored information.
non-additive change: A memory update that replaces or removes existing information rather than simply appending new information, making the prior state irrecoverable without explicit versioning.
version-aware memory: A memory system that tracks the history of how stored information has changed over time, allowing an agent to reason about which version of a rule or fact was in effect at a given point.
append-only patch: A record added to memory history that is never deleted or overwritten, preserving a complete audit trail of all memory transitions.
Terminus2: One of the LLM agent architectures augmented by EvoMem in the paper's experiments, operating in terminal-based task environments.
OpenHands: An open platform for AI software developer agents, used as one of the base architectures augmented by EvoMem in the experiments.
A-Mem: An agentic memory system for LLM agents, used as one of the base architectures augmented by EvoMem in the experiments.
Memento-Skill: An agent architecture (also referred to as Memento-Skills in the references) used as one of the four base systems augmented by EvoMem.
SWE-Chain-Evo: An EvoArena subset benchmark derived from software-engineering tasks, where agents must reason about evolving code repositories across a chain of versions.
PersonaMem-Evo: An EvoArena subset benchmark focused on long-horizon shifts in user preferences, requiring agents to track and adapt to evolving personal preferences over time.
Terminal-Bench-Evo: An EvoArena subset benchmark that evaluates agents on hard, realistic command-line interface tasks that change across sequential environment releases.
temporal grounding: The ability of an agent to correctly associate a piece of information or a rule with the specific time period or version in which it was valid.
retrieval configuration: The settings that determine which patches from the memory history are fetched and provided to the agent when answering a query, balancing relevance against memory-context budget constraints.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers