OpenSkill: Open-World Self-Evolution for LLM Agents

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

OpenSkill bootstraps agent self-evolution by synthesizing skills and verification signals from open-world knowledge.

How can an LLM agent acquire and refine complex skills from open-world resources without any human-provided rewards, demonstrations, or ground-truth verifiers?

LLM agents often fail to improve after deployment because they lack the curated skills, successful trajectories, or feedback signals required to learn from their own mistakes. OpenSkill solves this by treating the open world as a source for both knowledge and supervision: it retrieves external documentation to draft skills and generates deterministic virtual tests to verify them, all without accessing hidden target-task answers. Across three benchmarks, this approach achieves the best automated pass rate, with skills that transfer across models without retraining.

Paper Primer

The pipeline operates in three stages: it first retrieves domain-specific knowledge and verification anchors from the web, then iteratively refines skills against a self-built virtual test suite, and finally deploys the resulting artifacts zero-shot. The core move is the use of an isolated "Independent Verifier" that generates deterministic pytest assertions based on independently verifiable facts, creating a supervision-free practice environment.

OpenSkill achieves the highest automated pass rate across all tested benchmarks and target agents.

Outperformed the strongest closed-world baseline (Skill-Creator/CoT) by +8.9 and +8.8 points on SkillsBench.

Generated skills are model-agnostic and transfer effectively to weaker models without adaptation.

Skills produced by Opus 4.6 improved performance on four smaller models (e.g., Haiku 4.5, Mistral Large 3.3) by 5.5%–14.8% points. Consistent gains across all four tested target models.

Why is this approach better than simply asking the model to generate its own skills?

Standard self-generation relies on the model's internal parametric knowledge, which is often insufficient for niche domains or up-to-date APIs. OpenSkill grounds skill construction in external, retrieved evidence and provides a verification loop that prevents the model from hallucinating its own success.

What is the scope of this framework — does it work for any task?

It is designed for tasks where skill quality is the limiting factor. It is less effective for tasks requiring deep semantic quality properties beyond what the task specification provides, or for benchmark-specific meta-validation checks that require access to the hidden test infrastructure.

Introduction to OpenSkill

We introduce Open‑World Self‑Evolution, letting LLM agents learn skills and verification without human supervision.

Existing self‑evolving agents assume a ready learning loop—curated skills, successful trajectories, or verifier signals—but real open‑world deployments often provide only a raw task prompt.

An LLM agent starts from a single task prompt and, using only open‑world resources, builds both the skills it needs and the verification signals that let it improve those skills without any target‑task supervision.

How does Open‑World Self‑Evolution differ from supervised self‑evolution?

Supervised self‑evolution relies on external feedback such as rewards or human labels, while Open‑World Self‑Evolution constructs its own verification tasks from open‑world evidence, removing any need for external supervision during learning.

**Figure 1. Paradigms for self-evolving agent skills.** Unlike human-curated, LLM-generated, or supervised self-evolution paradigms, OpenSkill (ours) acquires skills from the open world and verifies them with self-built virtual tasks, making it simultaneously scalable, grounded, and supervision-free.

The shift from human‑supervised pipelines to autonomous, open‑world driven skill acquisition is the core contribution of OpenSkill.

The OpenSkill Framework

OpenSkill builds skills from open‑world knowledge and self‑generated tests without any ground‑truth supervision.

Without any ground‑truth test suite or human feedback, an LLM agent must bootstrap a skill from only a task prompt and publicly available resources. This supervision‑free setting makes skill construction extremely brittle: the agent sees only the instruction $I_i$ and the environment $E_i$, yet must produce a working solution.

The Virtual Verifier is a self‑contained test generator that turns publicly verifiable facts into deterministic assertions, letting the agent gauge skill quality without ever seeing the hidden ground‑truth suite.

How does a Virtual Verifier differ from a conventional test suite that a developer writes?

A conventional suite is hand‑crafted against a known solution and may encode hidden expectations; the Virtual Verifier automatically derives checks from publicly documented facts, guaranteeing that no target‑specific answer is leaked into the test.

Generate an initial skill set $\hat{S}^{(0)}$ from the Stage 1 plan $p_i$.

Construct the Virtual Verifier suite $\tilde{T}_i$ using verification knowledge $k_i^{v}$.

Evaluate $\hat{S}^{(j)}$ against $\tilde{T}_i$ to obtain the virtual pass rate $\tilde{r}^{(j)}$.

If $\tilde{r}^{(j)} < 1$, collect the failure diagnostic $F^{(j)}$ and feed it back to the agent.

If the diagnostic signals a knowledge gap, retrieve additional documents $k_{\text{gap}} = D(F^{(j)}, K)$ and augment the context.

Update the skill set to $\hat{S}^{(j+1)}$ using the agent $\pi_\theta$ conditioned on $I_i$, $E_i$, $p_i$, $k_i$, and $F^{(j)}$.

Repeat until $\tilde{r}^{(j)} = 1$ or the maximum round $J$ is reached.

Iterative refinement of a skill set using the Virtual Verifier.

The verifier reads the fact and emits the assertion $\tilde{t}$.

The current skill attempts to load “iris.csv” and counts rows, producing $c = 150$.

The test evaluates $c == 150$ → passes (value 1).

The virtual pass rate $\tilde{r}$ becomes 1.0, so the loop terminates.

This example shows how a deterministic, externally anchored check can drive skill refinement without any hidden ground‑truth label.

**Figure 2.** Overview of the OpenSkill framework. A base agent acquires open-world knowledge from external resources to build a skill plan, then iteratively generates, executes, and refines the skill in a sandbox, using a virtual-task verifier and diagnostic retriever to fix bugs and knowledge gaps. A leakage barrier keeps target supervision out of skill construction, unlocking it only for final evaluation.

After the loop finishes, the final skill $\hat{S}^*$ is handed to any target agent $\pi_{\theta'}$ and executed zero‑shot, yielding a pass/fail outcome on the hidden ground‑truth tests.

Experimental Results

OpenSkill delivers the highest pass rates across all evaluated benchmarks.

OpenSkill attains the highest overall pass rate on SkillsBench for Opus 4.6, reaching 43.6 %.

Table 1 shows Opus 4.6 achieving 43.6 % overall, surpassing the next‑best baseline (Skill‑Creator) by 8.9 points.

OpenSkill also leads for GPT 5.2, attaining an overall pass rate of 42.1 %.

Table 1 records GPT 5.2 at 42.1 %, outpacing the strongest baseline (CoT) by 8.8 points.

On SocialMaze, OpenSkill reaches 82.7 % for Opus 4.6 and 70.7 % for GPT 5.2.

Table 2 lists these values as the highest among all automated methods.

On ScienceWorld, OpenSkill attains 90.0 % (Opus 4.6) and 85.3 % (GPT 5.2), the top scores across all methods.

Table 2 shows OpenSkill outperforming the next best automated method by 1.3–2.2 points.

SkillsBench aggregates 11 distinct task domains where the quality of a learned skill directly determines success.

SocialMaze tests an agent’s ability to reason about multi‑agent interactions and social conventions.

ScienceWorld evaluates interactive scientific problem solving in a simulated lab environment.

**Table 1.** Main results on SkillsBench (11 domains): average reward (pass rate, %) by domain for two target agents. Best automated method per row in bold, second best underlined; the OPENSKILL column is shaded. Human is a reference upper bound (set off on the right) and is excluded from the best-method comparison. The $\Delta$ vs. No Skill row gives each method's overall pass-rate gain over the No-Skill floor (in points); negative values fall below it.

**Table 2.** Average reward (%) on SocialMaze and ScienceWorld for both target agents. Best automated score per column in bold, second best underlined; “–” marks methods not run on a benchmark.

OpenSkill consistently outperforms parametric‑only baselines across diverse domains.

Performance Analysis and Ablations

Ablation analysis quantifies skill transfer, virtual verifier fidelity, and component impacts.

We address three research questions: (RQ1) whether generated skills transfer across models, (RQ2) how well the Virtual Verifier’s proxy tests align with ground‑truth outcomes, and (RQ3) the contribution of each design component.

OpenSkill transfers skills across four target models, boosting average reward by up to +14.8 pp over the no‑skill baseline.

Skill libraries produced by Opus 4.6 were applied unchanged to Haiku 4.5, Qwen 3 Coder, DeepSeek V3, and Mistral Large 3 and evaluated on SkillsBench.

**Figure 3.** Average reward (%) when transferring Opus 4.6-generated skills to other models on SkillsBench.

The Virtual Verifier’s proxy tests agree with ground‑truth outcomes 60.7 % of the time.

Table 3 reports 56.9 % precision, 80.5 % recall, and a statistically significant association (OR = 2.97, p = 0.035; point‑biserial r = 0.242, p = 0.027).

**Table 3.** Percentage distribution of proxy results and GT rewards ($N = 84$).

Virtual Verifier covers 88.9 % of human‑authored test intents while generating 3.4× more test functions per task.

Semantic matching on 15 sampled tasks found 120 of 135 intents addressed; median of 3.4× more tests and 15.3 additional assertions per task.

**Figure 4.** Ablations on SOCIALMAZE (Opus 4.6). (a) Reward peaks at a few refinement iterations and degrades with more, indicating overfitting to virtual feedback. (b) Open-world query (DR) and the virtual verifier (VV) each improve over the parametric-only baseline and are largely complementary, with the combination performing best.

Information Isolation Audit

We audit how OpenSkill keeps the Virtual Verifier from seeing any ground‑truth data.

OpenSkill lets LLM agents acquire new abilities by pulling open‑world information and crafting their own virtual verification tasks, so no human‑provided rewards or ground‑truth labels are needed.

Isolation guarantees that the Virtual Verifier can only use the task description and the environment files; it never sees the true solution or the author’s test suite.

Code‑level isolation is enforced by the agent’s API: the base agent’s function signature accepts only the task instruction and the environment directory, explicitly omitting any solution or test‑path arguments.

Container‑level isolation runs each task inside a Docker image built from the provided Dockerfile; the host‑side tests/ and solution/ directories are never mounted into the container, so the verifier cannot read them.

The OpenSkill evolution loop overrides the parent class’s ground‑truth oracle, guaranteeing that the oracle is never called during skill creation; the loop stops as soon as the surrogate test passes.

Log‑level verification scans the `evolution_run_log`.json files and finds zero references to ground‑truth test files; any GT‑related entries appear only in post‑hoc evaluation fields that are never fed back to an agent.

Prompt‑level enforcement adds explicit instructions to both the surrogate‑writer and the independent verifier prompts: “You must ONLY use information from the task instruction and environment files. You have NO access to the solution or ground‑truth tests.”

G.1 Self‑Generated Skills Prompt replicates the baseline condition of SkillsBench: the agent first writes 1–5 modular skill markdown files under environment/skills/ before attempting the main task.

G.2 CoT‑Guided Self‑Generation Prompt adds a five‑step chain‑of‑thought workflow (analysis, architecture design, skill writing with YAML front‑matter, self‑verification, execution) but still provides no external verification, achieving only a 30.7 % pass rate.

Related Work and Conclusion

We place OpenSkill among prior self‑evolving agents, then recap its results and limits.

Prior work on self‑evolving agents has explored tool use, reflective planning, and skill distillation, but most approaches embed skills directly in model weights, making them opaque and hard to transfer. Open‑World Retrieval (OW) has been used to fetch external evidence for single‑task queries, yet OpenSkill treats OW as a continual source of reusable skill material. Moreover, existing self‑verification signals rely on model priors or target‑task feedback, whereas OpenSkill anchors verification in independently retrieved facts.

**Table 4.** Capability comparison of the automated methods. OW retr.: acquires open-world knowledge beyond parametric/experience memory; Refine: iteratively refines skills; SF verif.: builds a supervision-free verification signal (no target-task feedback); Artifact: produces an explicit, model-transferable skill. SKILLNET (SCIENCEWORLD only) is omitted.

Across the SocialMaze benchmark, OpenSkill consistently outperforms alternatives on every subtask, achieving the highest average reward for both target agents (Opus 4.6 and GPT 5.2). The per‑subtask gains reflect the method’s ability to synthesize transferable skills that generalize beyond the specific agent it was trained on.

**Table 6.** Per-subtask reward (%) on SocialMaze for the Opus 4.6 and GPT 5.2 target agents. OPENSKILL rows are shaded; best average per target agent in bold.

In summary, OpenSkill demonstrates that agents can acquire open‑world knowledge, construct self‑generated verification tasks, and refine explicit skill artifacts without any target‑task supervision, achieving state‑of‑the‑art pass rates on three benchmarks and transferring skills to weaker models.

Nevertheless, open‑world self‑evolution brings new challenges: web sources can be noisy or outdated, virtual tasks may under‑ or over‑estimate real task difficulty, and the added retrieval and verification steps increase computational cost and latency.

Experimental Configuration

Experimental setup: models, benchmarks, baselines, metrics, and hyperparameters.

The OpenSkill pipeline instantiates all LLM roles with either Anthropic Claude (claude‑opus‑4‑6) or OpenAI GPT (gpt‑5.2). Retrieval components use Google Gemini: the knowledge‑acquisition role $D$ calls the Deep‑Research agent, while verification‑knowledge $D_v$ and diagnostic‑driven retrieval both rely on the gemini‑3.1‑flash‑lite model. Zero‑shot evaluation employs Claude Opus for Opus 4.6, Codex for GPT‑5.2, and the terminus‑2 agent for any remaining LLMs.

We evaluate on three agentic benchmarks. SkillsBench covers eleven domains (Software, Office, Science, Media, Cybersecurity, Finance, Robotics, Energy, Manufacturing, Health, Math) and isolates skill quality as the performance bottleneck. SocialMaze provides six social‑reasoning tasks (FTS, HRD, REFT, RDP, SGA, UPI). ScienceWorld is an interactive science‑experiment environment. For every benchmark the hidden ground‑truth test suite $T_{GT}$ is consulted only at final evaluation.

We compare OpenSkill against seven automated baselines and a human‑curated upper bound. “No Skill” runs the target agent without any skill artifact, establishing a zero‑knowledge floor. “Self‑Gen” lets the agent author 1–5 SKILL.md files in a single forward pass, with no retrieval or verification. “CoT” adds a five‑step chain‑of‑thought prompt to the same self‑generation setting. “Skill Creator” (Claude Code) runs a Draft→Test→Review→Improve loop but relies only on parametric knowledge. “AutoSkill” abstracts skills from interaction traces into a hierarchical bank. “Memento” stores reusable markdown skills and refines them via a Read–Write Reflective Learning cycle. “SkillNet” scores skills on safety, completeness, executability, maintainability, and cost‑awareness (evaluated only on ScienceWorld). Human‑curated skills serve as the reference upper bound. All third‑party methods use the same claude‑opus‑4‑7 backbone and hidden‑test protocol to isolate the effect of the skill‑construction mechanism.

Metrics are reported uniformly as “reward”. For SkillsBench we report the average pass rate over the hidden test suite (Eq. 6), averaging per‑domain scores across five zero‑shot evaluation runs. The virtual‑verifier analysis adds alignment statistics (precision, recall, agreement) and statistical tests (Fisher’s exact, point‑biserial correlation) plus coverage of ground‑truth intents. SocialMaze accuracy is the macro‑average percentage of correctly solved sub‑tasks. ScienceWorld uses the simulator’s 0–100 completion score, averaged over all task variations.

**Table 1.** Experimental settings for the research agent framework.

Additional hyperparameters control the refinement loop and retrieval behavior. The verifier/creator is allowed up to 20/60 episodes respectively, while diagnostic‑driven retrieval uses the same gemini‑3.1‑fl classifier with temperature 0.1 and a 200‑token cap. The system permits at most three searches per task, three idle episodes before termination, and a single host intervention per skill‑generation episode.

Implementation Details

Implementation details for the Open‑World retrieval, virtual verification, and iterative refinement pipeline.

The pipeline runs on a host machine that launches a per‑task Docker container; inside the container the skill creator and the virtual verifier interact to produce and validate new abilities.

The planner asks an external research engine for API documentation and code examples without revealing the target benchmark.

A second, orthogonal search retrieves concrete, checkable facts that the virtual verifier will later use as ground truth.

An isolated verifier creates a deterministic pytest suite that checks the skill’s outputs without seeing the skill’s reasoning.

Iterative refinement loop for skill creation (high‑level pseudocode).

A lightweight classifier decides whether a failed assertion is due to a simple implementation bug or a missing piece of domain knowledge.

The last refined skill set $\hat{S}^*$ is exported directly; no best‑of‑N snapshot is taken.

Failure‑mode analysis of the virtual verifier (Section C) reveals 33 disagreements with the ground‑truth evaluator: 25 false positives (high‑accuracy near‑misses, partial correctness, and genuine misalignment) and 8 false negatives (near‑pass cases and infrastructure failures).

Table 6 (Section D) breaks down SocialMaze performance per subtask. OpenSkill leads overall (82.7 % on Opus, 70.7 % on GPT) mainly because it excels on the harder reasoning subtasks (REFT, UPI, RDP), while gains on easier subtasks are modest.

Computational Cost Analysis

Quantifies the token and wall‑clock resources required by OpenSkill.

The OpenSkill pipeline splits into a skill‑creation phase (Stages 1–4) and an evaluation phase (Stage 5). Creation generates the skill once; evaluation runs the skill repeatedly on downstream agents.

Because skill creation is performed once, downstream agents incur essentially zero additional creation cost; only the evaluation stage scales with the number of runs.

Read the original paper

Open the simplified reader on Paperglide