LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

Q: What is the main contribution of LatentSkill?

LatentSkill introduces a hypernetwork-based compiler that translates procedural textual skill documents into LoRA adapters, replacing in-context skill injection with in-weight latent skills mounted on a frozen backbone LLM, cutting prefill token overhead by up to 72% and improving agent task success rates.

Q: What problem does LatentSkill address?

LLM agents currently inject textual skill descriptions directly into prompts, which inflates context size, consumes valuable context slots, and exposes proprietary procedures to potential leakage or adversarial attacks. LatentSkill eliminates this by encoding skills into LoRA adapter weights instead.

Q: Why is moving skills from the prompt into LoRA weights beneficial?

Encoding skills as LoRA adapters removes repeated skill tokens from the prompt, reduces exposure risk for proprietary procedures, restores modularity for updating and composing skills, and hardens the agent against both accidental text perturbations and adversarial prompt manipulations.

Q: How does the LatentSkill compiler work technically?

The compiler is a hypernetwork G_φ that takes a skill document as input and outputs LoRA adapter weight deltas, which are then mounted on the frozen backbone LLM. At inference, skills are compiled once, cached, and selected per task without appearing in the prompt.

Q: What is the two-stage training process used by LatentSkill?

First, the hypernetwork is pretrained on approximately 171K deduplicated skill documents (~300M tokens) to learn to map procedural text to usable adapter weights, with the backbone LLM frozen. Second, trajectory-supervised fine-tuning (SFT) refines the compiler using 237 ALFWorld and 500 Search-QA teacher trajectories to align the adapters with agent policy.

Q: What benchmarks and datasets were used to evaluate LatentSkill?

Evaluation used ALFWorld (an embodied task benchmark with a seen split of 140 episodes and an unseen split of 134 episodes, capped at 50 steps each) and Search-QA (a retrieval-augmented QA benchmark sampling 500 examples per dataset, with 125 for Bamboogle), measuring success rate and Exact Match respectively.

Q: What are the key quantitative results reported for LatentSkill?

LatentSkill reduces prefill token overhead by up to 72% and shortens average interaction trajectory length from 35.0 to 28.4 steps on the ALFWorld seen split, indicating more efficient planning; the paper also reports significantly improved task success rates across both benchmarks compared to in-context baselines.

Q: How does LatentSkill differ from standard LoRA fine-tuning?

Standard LoRA trains adapters jointly with a downstream loss, tying each adapter to a specific task. LatentSkill's compiler learns a general function G_φ that can produce adapters for any skill document at inference time, enabling zero-shot reuse of the same adapter generation mechanism across tasks without retraining the backbone.

Q: How does LatentSkill differ from direct fine-tuning of the backbone?

Direct fine-tuning fuses skills irreversibly into the backbone, making them difficult to update, remove, or combine. LatentSkill keeps skills modular as separately loadable LoRA adapters that can be swapped or composed at inference time without retraining the backbone.

Q: How does LatentSkill support composing multiple skills?

Multiple skill adapters can be composed in weight space by summing their LoRA deltas with per-skill injection coefficients (Δ_K = Σ α_k C[k]). For skills with shared subcomponents, LatentSkill supports component-level composition (Skill Arithmetic), which decomposes skills into semantic components, compiles each independently, and adds only unique parts while scaling shared parts to avoid over-amplification.

Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

LatentSkill converts textual agent procedures into modular LoRA adapters, eliminating prompt overhead while enabling weight-space skill composition.

How can we convert textual procedural skills into compact, in-weight model parameters to avoid the context overhead and security risks of in-context prompting?

LLM agents rely on textual skills to handle complex tasks, but repeatedly injecting these instructions into the prompt consumes context and exposes proprietary procedures. LatentSkill replaces textual conditioning with a hypernetwork that compiles skill documents into plug-and-play LoRA adapters, which are then mounted on the frozen backbone LLM. This approach reduces prefill token overhead by up to 72% while significantly improving task success rates across embodied and search-augmented benchmarks.

Paper Primer

The framework uses a two-stage training process: a pretraining phase where the hypernetwork learns to map procedural text to weight updates, followed by trajectory-supervised fine-tuning to align these adapters with agent policy. At inference, the system selects relevant skills, compiles them into LoRA weights, and mounts them on the backbone: the compiler acts as a translator that converts natural language strategy into a geometric weight-space modification.

LatentSkill significantly outperforms in-context skill prompting in both task success and token efficiency.

On ALFWorld unseen tasks, LatentSkill improved success rates by 13.4 points while reducing prefill token overhead by 64.1%. 72.2% reduction in skill-token overhead on Search-QA.

Beyond efficiency, the generated LoRA weights exhibit a structured semantic geometry where skills from different domains form separable clusters. These adapters are controllable via a scaling coefficient and composable through parameter-space arithmetic, provided that skill components are semantically aligned before merging.

Why use a hypernetwork to generate LoRA adapters instead of just fine-tuning the model on the skills directly?

Direct fine-tuning fuses skills into the backbone, making them irreversible and difficult to update or combine. LatentSkill’s hypernetwork approach keeps skills modular, allowing them to be loaded, swapped, or composed at inference time without retraining the backbone.

Does this approach work for any arbitrary text, or is it limited to specific skill formats?

The framework is designed for procedural skill documents. While it generalizes to out-of-distribution domains like Code and Finance, its effectiveness relies on the hypernetwork's ability to map procedural text into a structured weight space; it is not intended for general-purpose prompt compression.

By shifting procedural knowledge from context to weight space, LatentSkill enables a modular agent architecture that is more efficient, robust to prompt-level attacks, and capable of dynamic skill composition.

Introduction to LatentSkill

We expose the inefficiency of prompt‑based skill injection and propose moving skills into LoRA adapters.

LLM agents currently rely on inserting textual skill descriptions into the prompt, which inflates context size and reveals proprietary procedures.

Prompt‑based skill injection consumes valuable context bandwidth and leaks the skill’s content as plain text, while also tying skill updates to costly prompt rewrites.

**Figure 2.** Overview of LatentSkill. Left: textual skills are transformed into in-weight latent skills through hypernetwork-based LoRA generation. Middle: the skill compiler is trained by skill document pretraining and trajectory-supervised fine-tuning. Right: the resulting latent skills support structured semantic geometry, controllable injection strength, and composable parameter-space arithmetic at inference time.

This shift from textual context to in‑weight adapters eliminates repeated skill tokens, reduces exposure risk, and restores modularity for updating and composing skills.

The key insight is that moving skills from the prompt into LoRA adapters cuts context overhead while preserving modular control.

Context and Prior Approaches

We situate LatentSkill among prior skill‑injection and hypernetwork approaches.

At inference time the model receives a textual description of a skill and must apply it directly from the prompt, treating the description as part of the context.

Agents interleave reasoning and action, and improve via self‑reflection; external knowledge is retrieved into the context at decision time.

Introduced a growing repository of reusable, executable skills that agents can call during interaction.

Distills trajectories into a multi‑level SkillBank that co‑evolves with the policy via reinforcement learning.

Industry‑adopted standard for describing agent skills as natural‑language snippets; used in Claude Code, Cursor, Gemini CLI.

Trains the model to embed skills directly into its parameters via a curriculum that gradually removes skill context.

Hypernetworks predict the weights of a target LoRA module, enabling rapid generation of adapters without per‑task fine‑tuning.

Uses small MLPs to generate layer‑wise weight segments from textual descriptions, then concatenates them into a full LoRA adapter.

Leverages hidden states of the backbone LLM to produce LoRA weights, but only for a subset of modules due to parameter cost.

Compresses context into a fixed token set before generating LoRA adapters, introducing an information bottleneck.

Extracts memory states at every layer and enables bidirectional cross‑layer flow via attention when generating LoRA adapters.

Employs a Perceiver‑style encoder to generate LoRA adapters from documents, focusing on context internalization.

The LatentSkill Compiler

Defines how textual skills are compiled into compact LoRA adapters for frozen LLMs.

In‑context prompting forces the model to carry the entire skill text in its prompt, consuming valuable context slots and leaking proprietary content. LatentSkill eliminates this waste by moving the skill into the model’s parameters.

The compiler translates a high‑level textual skill (like source code) into a compact LoRA adapter (the machine‑code representation) that can be injected directly into a frozen LLM.

How does this compiler differ from a standard LoRA adapter that is trained end‑to‑end on a downstream task?

Standard LoRA learns adapters jointly with the downstream loss, tying the adapter to a specific task. The LatentSkill compiler, by contrast, learns a *function* $G_{\phi}$ that can produce adapters for *any* skill document, enabling zero‑shot reuse of the same adapter generation mechanism across tasks.

Input token embeddings for the skill are encoded into a 4‑dim vector.

$G_{\phi}$ produces $A_s\in\mathbb{R}^{4\times2}$ and $B_s\in\mathbb{R}^{2\times4}$ with small integer entries (e.g., $A_s=[[1,0],[0,1],[1,1],[0,0]]$, $B_s=[[2,0,0,0],[0,2,0,0]]$).

The low‑rank update $\Delta W = B_s A_s$ yields a $4\times4$ matrix whose non‑zero block is confined to the first two rows and columns.

When added to the frozen weight $W$, the model now has a dedicated “summarize” pathway without altering any other parameters.

The example shows that a tiny rank‑2 adapter can encode a non‑trivial skill, illustrating why the compiler’s low‑rank design keeps the parameter overhead minimal.

With the adapter mounted, the agent predicts from the task history alone, freeing the prompt from bulky skill text and enabling fine‑grained control via the injection coefficient.

Skill Document Pretraining

3.2 Skill Document PretrainingFirst, we pretrain the skill compiler on a corpus of textual skill documents, denoted by $D_{\text{pre}} = \{s_i\}_{i=1}^N$. The goal is to initialize $G_{\phi}$ to map procedural text into usable adapter weights while keeping the backbone LLM frozen. Given a skill document $s$, we randomly instantiate one of two document-level pretraining tasks. In the reconstruction task, the compiler reads the complete skill document $s$, and the adapted backbone receives a reconstruction instruction as input and is trained to reproduce the original document $s$. In the completion task, we construct a truncated prefix $\tilde{s}$ by randomly removing the latter part of the document; the compiler reads $\tilde{s}$, and the adapted backbone is trained to complete the full skill document.

For each skill, we construct document-level supervision instances $(s_{\text{src}}, q_i, z_i)$ where $s_{\text{src}}$ is the text provided to the compiler, $q_i$ is the prompt given to the adapted backbone, and $z_i$ is the target output. Let $\Delta^{\text{pre}}_i$ denote the pretraining supervision for instance $i$. The pretraining objective is

$$ L_{\text{pre}} = - \sum_{i}\sum_{j} \log p_{\theta\oplus\alpha}\bigl(z_{i,j}\mid q_i, z_{i,<j}\bigr) \tag{4} $$

where the summation ranges over all document-level supervision instances and target tokens.

Only the compiler parameters $\phi$ are updated. Since the skill document is provided to $G_{\phi}$ rather than directly to the adapted backbone, information useful for predicting $z_i$ must be mediated through the generated adapter.

Trajectory-Supervised Fine-Tuning

3.3 Trajectory‑Supervised Fine‑TuningAfter pretraining, we fine‑tune the skill compiler with teacher agent trajectories. Let $D_{\text{sft}} = \{(s_i, \tau_i)\}_{i=1}^M$ denote the supervised dataset. Each example pairs a skill document $s_i$ with a teacher trajectory $\tau_i = \{(h_{i,t}, y^{*}_{i,t})\}_{t=1}^T$, where $h_{i,t}$ is the agent history at step $t$ and $y^{*}_{i,t}$ is the teacher output.

For each pair $(s_i, \tau_i)$, the compiler generates one latent skill, denoted by $\Delta_{\text{sft},i} = G_{\phi}(s_i)$. The same adapter is mounted throughout the entire trajectory. The fine‑tuning objective is

$$ L_{\text{sft}} = - \sum_{i,t,j} \log p_{\theta \oplus \alpha \Delta_{\text{sft},i}}\bigl(y^{*}_{i,t,j}\mid h_{i,t}, y^{*}_{i,t,<j}\bigr), $$

where the summation ranges over all trajectories, decision steps, and target tokens.

The backbone remains frozen, and only $\phi$ is updated. Since $\Delta_{\text{sft},i}$ is generated solely from the skill document $s_i$ and shared across all decision steps in $\tau_i$, the objective encourages the adapter to capture skill‑level, trajectory‑consistent policy information rather than per‑step adaptations. This aligns the compiler to produce latent skills whose effects remain stable across multi‑step interaction.

3.4 Inference‑Time Skill Control and Composition

At inference time, skill compilation is separated from agent execution. Given a skill library $S = \{s_1, \dots, s_K\}$, each skill can be compiled once and stored in an adapter cache $C[k] = G_{\phi}(s_k)$. After compilation, the skill is not included in the prompt. For a task instance, a skill selector chooses one or more relevant skills. If a single skill $s_k$ is selected, its cached adapter $C[k]$ is mounted on the backbone with injection coefficient $\alpha_k$. The agent then predicts each step from the current history $h_t$ using the adapted model. Setting $\alpha_k = 0$ recovers the frozen backbone, while larger values increase the influence of the latent skill.

When multiple skills are selected, LatentSkill composes their adapters in weight space:

$$ \Delta_{K} = \sum_{k \in K} \alpha_k C[k], $$

where $K$ is the selected skill set. The composed adapter is then mounted on the LLM for inference. For skills with shared subcomponents, direct adapter addition may over‑amplify common behavior. LatentSkill therefore also supports component‑level composition. Specifically, a skill can be decomposed into semantic components $s_k = \{c_{k,1}, \dots, c_{k,L_k}\}$, each component can be compiled independently as $\Delta_{k,\ell} = G_{\phi}(c_{k,\ell})$, and the final adapter can be formed by adding retained shared and skill‑specific components, e.g.,

$$ \Delta_{\text{comp}} = \sum_{c \in U} \gamma_c G_{\phi}(c), $$

where $U$ is the selected component set and $\gamma_c$ is an optional component‑level injection coefficient.

Experimental Results

LLM agents shift from prompt‑injected skills to compact LoRA adapters with LatentSkill.

LatentSkill attains the highest average success on both ALFWorld and Search‑QA benchmarks.

Table 1 shows 74.3 % (seen) and 69.4 % (unseen) average success; Table 2 shows a 35.6 % average exact‑match score.

Beyond raw scores, LatentSkill shortens interaction trajectories, dropping the average step count from 35.0 to 28.4 on the seen split, which indicates more efficient planning.

Controllable Skill Injection

Moderate LoRA injection yields peak performance, but too much harms the backbone.

Performance peaks at moderate injection, reaching 74.3 % average success on the seen split.

Figure 4 shows the rise to 74.29 % at $\alpha$=0.6; Table 11 reports the same peak.

The coefficient $\alpha$ linearly scales the LoRA weight $\Delta$W before it is added to the frozen backbone, letting us dial skill strength up or down.

How does scaling $\alpha$ differ from stacking additional LoRA adapters?

Stacking adds more parameters and changes the representation space, while scaling $\alpha$ keeps the same low‑rank adapter and merely amplifies or attenuates its contribution, preserving the model’s parameter budget.

The inverted‑U curve shows that moderate injection strengthens skill behavior, but excessive scaling disrupts the backbone—much like turning a speaker up too far distorts the sound. Four of six tasks share the same optimal $\alpha$, yet Pick2, Clean and Heat require slightly higher values because their backbone baselines are weaker. This suggests a stable effective injection range, but task‑specific tuning can recover up to 21 % more success.

**Figure 4.** Scale-performance curves on ALFWorld under varying LoRA injection coefficient $\alpha$. **Top:** Pick vs. Pick2, the same skill but differing in difficulty. Stars mark the per-task optimal $\alpha$. **Bottom:** Clean vs. Cool on the unseen split, using different skills. Shaded regions indicate the performance gain over the $\alpha=0$ baseline.

Skill Composition in Parameter Space

Component Merging reaches 84.6% seen and 77.8% unseen, outperforming other composition methods.

Component Merging attains 84.6% success on seen episodes and 77.8% on unseen episodes, the highest among all composition strategies.

Table 3 reports 84.6% (Seen) and 77.8% (Unseen) for Component Merging, surpassing Direct Merging (72.2% unseen) and Text Merging (61.1% unseen).

Skill Arithmetic treats LoRA adapters as algebraic objects that can be added, scaled, or subtracted, letting multiple skills be merged directly in parameter space.

How does Skill Arithmetic differ from simply averaging the two LoRA adapters?

Naïve averaging treats every weight entry equally, which double‑counts shared components. Skill Arithmetic first aligns components, then adds only the unique parts and scales shared parts to avoid over‑amplification.

**Table 3.** Skill composition results on all 31 Look task episodes under five skill composition configurations. The best result per split is highlighted in bold.

**Table.** Performance comparison across different methods on ALFWorld and Search-QA datasets, evaluating In-context versus Latent approaches.

Direct Merging simply averages whole skill adapters, which over‑amplifies shared components and harms unseen performance. Text Merging concatenates skill texts before compilation, producing an out‑of‑distribution document that the compiler cannot align properly.

Robustness and Limitations

LatentSkill compiles textual skills into compact LoRA adapters, avoiding prompt bloat.

We probe robustness by perturbing the skill text and by launching prompt‑level attacks. Four text perturbations—Paraphrase, Plaintext, Reorder, and Noise—are applied, and two attacks—Hijack and Extract—test the system’s resistance to malicious instructions and content leakage.

**Table 12.** Full per-task sensitivity results under four perturbation types. ALFWorld reports success rate (%) and Search-QA reports exact match (%). Each perturbation is evaluated with both In-context Skill and LatentSkill. The best result per perturbation is highlighted in bold, and LatentSkill rows are shaded in blue.

These findings confirm that moving skills from prompt space into LoRA weights not only saves context but also hardens the agent against both accidental perturbations and adversarial prompt manipulations.

In conclusion, LatentSkill compiles textual agent skills into compact LoRA adapters, eliminating repeated prefill overhead and enabling efficient, modular, and controllable skill deployment. The generated adapters form a structured semantic geometry that can be steered via the injection coefficient and composed in parameter space when properly aligned.

Limitations include evaluation on only two benchmarks (ALFWorld, Search‑QA) and reliance on a single frozen backbone (Qwen3‑8B) with a fixed LoRA configuration. Future work should test broader task domains, larger model families, and varied adapter designs to assess generality.

Ablation Studies

Ablation experiments probe component importance via OOD skills, injection scaling, and skill merging.

We evaluate three orthogonal ablations: (1) the ability of the hypernetwork to generalize to out‑of‑distribution (OOD) skill texts, (2) the sensitivity of performance to the LoRA injection coefficient $\alpha$, and (3) the effect of different skill‑composition merging strategies on per‑episode outcomes.

**Table 10.** Out-of-distribution skill sources collected from public GitHub repositories across three unseen domains.

Across both Pretrain and SFT stages, the Frobenius norm $\|\Delta W\|$ of the generated LoRA weights remains tightly clustered (≈ 2.78 × 10⁻³), indicating that the hypernetwork emits weight updates of stable magnitude regardless of skill‑text complexity.

The table presents performance metrics (Rank-1, Rank-2, and Rank-5 percentages) for various skills across two categories: ALFWorld and Search. Each skill is evaluated under two conditions: Pre. (Pre-training) and SFT (Supervised Fine-Tuning), with SFT results highlighted in blue.

Singular‑value analysis (Table 9) shows that the top‑2 directions already capture ~67 % of the total energy, and the top‑5 capture ~93 %, confirming that skill knowledge is compressed into a very small subspace.

**Table 9.** Cumulative singular value energy ratio (%) of top-k directions at Pretrain and SFT stages. SFT columns are shaded in blue.

Injection‑coefficient ablation (Table 11) reveals that lowering $\alpha$ reduces success rates on both seen and unseen splits, confirming that the scaling factor is essential for injecting the compiled LoRA adapters effectively.

Case 1 illustrates a failure of the Look‑Only strategy: in episode 4 the agent never leaves the desk region, whereas the Complementary Capability Transfer strategy succeeds by leveraging the compiled skill.

Overall, these ablations demonstrate that (i) the hypernetwork’s weight generation is robust to skill‑text variation, (ii) the injection coefficient $\alpha$ is a critical control knob, and (iii) proper skill composition markedly improves per‑episode performance.

Qualitative Case Studies

We examine three case studies and quantitative analyses to illustrate how Component Merging resolves skill‑combination failures.

Component Merging decomposes a skill text into independent components—general, mistakes, and task‑specific—before encoding each through the hypernetwork. By mounting the general and mistakes components only once and superposing task‑specific modules, it avoids interference that plagues naïve merging approaches.

Case 1 highlights a missing systematic search strategy in the Look skill. The model finds the desk lamp (Step 12) but fails to locate the CD; Component Merging adds a pick‑specific search that scans shelves after exhausting drawers, discovering the CD at Step 14 and completing the task in 17 steps.

Case 2 exposes weight redundancy and decision disruption when Direct Merging mounts identical general and mistakes components for Look and Pick. The duplicated components over‑amplify shared behavioral patterns, suppressing the pick‑up trigger despite correct target identification, leading to 18 steps without a successful pick.

Case 3 demonstrates out‑of‑distribution (OOD) encoding failure of Text Merging. Although the model correctly picks up the alarmclock at Step 2, subsequent actions are repeatedly interrupted, causing a fragmented sequence that never completes the intended handoff.

Table 11 reports ALFWorld success rates across LoRA scaling factor $α$ on seen and unseen splits. Increasing $α$ generally improves the Pick task, while the Look and Clean tasks show non‑monotonic trends, illustrating trade‑offs between skill components.

Table 12 evaluates four perturbation types—Paraphrase, Plaintext, Reorder, and Noise—on both ALFWorld and Search‑QA. LatentSkill consistently outperforms the in‑context baseline under each perturbation, confirming robustness of the component‑wise encoding.

The perturbations target semantic content, formatting, ordering, and information density of skill texts, covering realistic degradation scenarios that may arise in deployment.

Overall, Component Merging preserves the strength of general behavioral patterns while cleanly injecting task‑specific capabilities, enabling coherent sequential decision making where Direct and Text Merging fail.

Training Configuration

Details of pretraining, fine‑tuning, skill mapping, and ablation analyses for the LatentSkill system.

We pretrain the hypernetwork on roughly $171\,$K deduplicated skill documents (≈$300\,$M tokens) using $8\times\text{H100}$ GPUs for $10$ epochs, with batch size $64$, learning rate $5\times10^{-5}$, $200$ warmup steps, and a maximum sequence length of $4{,}096$ tokens. AdamW with weight decay $0.1$ is used for optimization.

Supervised fine‑tuning (SFT) refines the pretrained hypernetwork on $237$ ALFWorld and $500$ Search‑QA task trajectories, again on $8\times\text{H100}$ GPUs for $10$ epochs, batch size $32$, learning rate $1\times10^{-5}$, $400$ warmup steps, and sequence length $4{,}096$. The same AdamW settings apply.

The table lists various datasets and their corresponding required skills for evaluation.

**Table 6.** Search-QA skill-to-task matching rules.

Evaluation on ALFWorld uses both a seen split ($140$ episodes) and an unseen split ($134$ episodes), capping each episode at $50$ steps and reporting success rate percentages.

Search‑QA evaluation samples $500$ examples per dataset (full $125$ for Bamboogle), employs the E5 retriever with top‑$k=3$ passages, allows up to $4$ retrieval steps, and measures Exact Match.

**Figure 5.** Per-module discriminability (within-domain minus cross-domain cosine similarity gap) for the 7 LoRA injection positions in Qwen3-8B, measured before (Pretrain) and after (SFT) instruction fine-tuning. `attn_o` and `mlp_down` exhibit substantially higher gaps, identifying them as the primary carriers of skill-specific knowledge.

We compare six injection configurations—full, full:o+d, last6, last6:o+d, first30, and first30:o+d—all with injection coefficient $\alpha=1$, to assess how limiting LoRA injection to `attn_o` and `mlp_down` affects performance.

The provided image contains a table comparing performance across different configurations (full, last6, first30) and their variants (o+d) for various tasks (Pick, Look, Clean, Heat, Cool, Pick2) and an average score, split into "Seen split" and "Unseen split" categories.

Finally, we analyze the low‑rank structure of the generated LoRA weights by measuring Frobenius norm, stable rank, and the cumulative top‑$k$ singular‑value energy of the weight increment $\Delta W$ for all skills at both pretrain and SFT stages.

Questions & answers

What is the main contribution of LatentSkill?

LatentSkill introduces a hypernetwork-based compiler that translates procedural textual skill documents into LoRA adapters, replacing in-context skill injection with in-weight latent skills mounted on a frozen backbone LLM, cutting prefill token overhead by up to 72% and improving agent task success rates.

What problem does LatentSkill address?

LLM agents currently inject textual skill descriptions directly into prompts, which inflates context size, consumes valuable context slots, and exposes proprietary procedures to potential leakage or adversarial attacks. LatentSkill eliminates this by encoding skills into LoRA adapter weights instead.

Why is moving skills from the prompt into LoRA weights beneficial?

Encoding skills as LoRA adapters removes repeated skill tokens from the prompt, reduces exposure risk for proprietary procedures, restores modularity for updating and composing skills, and hardens the agent against both accidental text perturbations and adversarial prompt manipulations.

How does the LatentSkill compiler work technically?

The compiler is a hypernetwork G_φ that takes a skill document as input and outputs LoRA adapter weight deltas, which are then mounted on the frozen backbone LLM. At inference, skills are compiled once, cached, and selected per task without appearing in the prompt.

What is the two-stage training process used by LatentSkill?

First, the hypernetwork is pretrained on approximately 171K deduplicated skill documents (~300M tokens) to learn to map procedural text to usable adapter weights, with the backbone LLM frozen. Second, trajectory-supervised fine-tuning (SFT) refines the compiler using 237 ALFWorld and 500 Search-QA teacher trajectories to align the adapters with agent policy.

What benchmarks and datasets were used to evaluate LatentSkill?

Evaluation used ALFWorld (an embodied task benchmark with a seen split of 140 episodes and an unseen split of 134 episodes, capped at 50 steps each) and Search-QA (a retrieval-augmented QA benchmark sampling 500 examples per dataset, with 125 for Bamboogle), measuring success rate and Exact Match respectively.

What are the key quantitative results reported for LatentSkill?

LatentSkill reduces prefill token overhead by up to 72% and shortens average interaction trajectory length from 35.0 to 28.4 steps on the ALFWorld seen split, indicating more efficient planning; the paper also reports significantly improved task success rates across both benchmarks compared to in-context baselines.

How does LatentSkill differ from standard LoRA fine-tuning?

Standard LoRA trains adapters jointly with a downstream loss, tying each adapter to a specific task. LatentSkill's compiler learns a general function G_φ that can produce adapters for any skill document at inference time, enabling zero-shot reuse of the same adapter generation mechanism across tasks without retraining the backbone.

How does LatentSkill differ from direct fine-tuning of the backbone?

Direct fine-tuning fuses skills irreversibly into the backbone, making them difficult to update, remove, or combine. LatentSkill keeps skills modular as separately loadable LoRA adapters that can be swapped or composed at inference time without retraining the backbone.

How does LatentSkill support composing multiple skills?

Multiple skill adapters can be composed in weight space by summing their LoRA deltas with per-skill injection coefficients (Δ_K = Σ α_k C[k]). For skills with shared subcomponents, LatentSkill supports component-level composition (Skill Arithmetic), which decomposes skills into semantic components, compiles each independently, and adds only unique parts while scaling shared parts to avoid over-amplification.

What is the injection coefficient α and why does it matter?

The injection coefficient α is a scalar that amplifies or attenuates the contribution of a mounted LoRA adapter without adding parameters or changing the representation space. Ablation results show that lowering α reduces success rates on both seen and unseen splits, and that an inverted-U curve exists where excessive scaling disrupts the backbone; four of six ALFWorld tasks share the same optimal α, while Pick2, Clean, and Heat require slightly higher values.

How robust is LatentSkill to perturbations and adversarial attacks?

The paper evaluates four text perturbation types (Paraphrase, Plaintext, Reorder, Noise) and two adversarial attacks (Hijack and Extract) on both ALFWorld and Search-QA. LatentSkill consistently outperforms the in-context baseline under each perturbation type, confirming robustness of the component-wise encoding against both accidental degradation and adversarial manipulation.

What are the limitations of LatentSkill as acknowledged by the paper?

The paper acknowledges evaluation on only two benchmarks (ALFWorld and Search-QA), reliance on a single frozen backbone (Qwen3-8B) with a fixed LoRA configuration, and that the framework is designed for procedural skill documents rather than general-purpose prompt compression. Future work is noted as needed for broader task domains, larger model families, and varied adapter designs.

Does LatentSkill generalize to out-of-distribution skill domains?

The paper reports that the hypernetwork generalizes to out-of-distribution domains such as Code and Finance, and that the Frobenius norm of generated LoRA weights remains tightly clustered (~2.78×10⁻³) regardless of skill-text complexity, indicating stable weight generation across varied inputs.

What geometric structure do the generated LoRA adapters exhibit?

The generated LoRA weights exhibit a structured semantic geometry where skills from different domains form separable clusters in weight space. Singular-value analysis shows that the top-2 directions capture ~67% of total energy and the top-5 capture ~93%, confirming that skill knowledge is compressed into a very small subspace.

What backbone model and hardware were used for training?

The backbone LLM is Qwen3-8B, kept frozen throughout training. Pretraining and SFT both used 8×H100 GPUs for 10 epochs, with AdamW optimizer (weight decay 0.1), learning rates of 5×10⁻⁵ (pretraining) and 1×10⁻⁵ (SFT), and a maximum sequence length of 4,096 tokens.

How does LatentSkill handle skill selection at inference time?

At inference time, all skills in the library are compiled once and stored in an adapter cache. A skill selector chooses one or more relevant skills for a given task instance, and the corresponding cached adapter (or composed adapter for multiple skills) is mounted on the backbone LLM without including any skill text in the prompt.

Who authored LatentSkill and where was it published?

The paper does not specify author names or the publication venue; it is available on arXiv at https://arxiv.org/abs/2606.06087.

Key terms

LatentSkill: The proposed framework that compiles textual agent skill documents into LoRA adapters via a hypernetwork, replacing in-context skill injection with in-weight latent skills on a frozen backbone LLM.
hypernetwork: A neural network that generates the weights of another network; here, the compiler G_φ that takes a skill document as input and outputs LoRA adapter weight deltas.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that adds small, low-rank weight matrices (adapters) to a frozen model's layers instead of updating all parameters.
LoRA adapter: A compact set of low-rank weight matrices generated by the LatentSkill compiler and mounted on the frozen backbone LLM to encode a specific skill.
skill document: A procedural text description of an agent skill, such as step-by-step instructions for completing a task, which serves as input to the LatentSkill compiler.
in-context skill: A skill encoded as text inserted directly into the LLM's prompt at inference time, consuming context tokens and potentially exposing proprietary content.
in-weight latent skill: A skill encoded as LoRA adapter weights mounted on the backbone LLM, replacing prompt-level text injection with parameter-space conditioning.
injection coefficient (α): A scalar multiplier applied to a mounted LoRA adapter that controls the strength of the skill's influence on the backbone LLM without adding parameters.
Skill Arithmetic: A component-level skill composition method that decomposes skills into semantic components, compiles each independently, and combines them by adding unique parts and scaling shared parts to avoid over-amplification.
adapter cache: A precomputed store of compiled LoRA adapters for each skill in the library, allowing skills to be retrieved and mounted at inference time without recompilation.
prefill token overhead: The number of input tokens consumed by skill text inserted into the prompt before the model begins generating a response, which LatentSkill reduces by up to 72%.
trajectory-supervised fine-tuning (SFT): A training stage where the hypernetwork compiler is refined using teacher agent trajectories that pair skill documents with sequences of correct agent actions.
ALFWorld: An embodied task benchmark used to evaluate LatentSkill, consisting of seen (140 episodes) and unseen (134 episodes) splits with success rate as the metric.
Search-QA: A retrieval-augmented question-answering benchmark used to evaluate LatentSkill, measuring Exact Match on 500 sampled examples per dataset.
Frobenius norm: A measure of the overall magnitude of a matrix computed as the square root of the sum of squared entries, used here to assess the stability of generated LoRA weight updates.
stable rank: A measure of the effective dimensionality of a matrix based on its singular values, used to characterize how compactly skill knowledge is encoded in the LoRA weight increments.
component-level composition: A skill merging strategy that decomposes each skill document into semantic sub-components (e.g., general, mistakes, task-specific), compiles each independently, and superimposes only the relevant components to avoid interference.
Hijack attack: An adversarial prompt-level attack that attempts to override the agent's intended behavior by injecting malicious instructions into the prompt.
Extract attack: An adversarial prompt-level attack that attempts to cause the agent to leak the content of its skill documents through its outputs.
Qwen3-8B: The specific frozen backbone large language model used in LatentSkill experiments, on which compiled LoRA adapters are mounted.

Read the original paper

Open the simplified reader on Paperglide