Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Code2LoRA uses hypernetworks to inject repository-specific knowledge into frozen LLMs via generated LoRA adapters.

How can we inject repository-specific knowledge into a frozen code LLM without increasing inference-time token overhead?

Code language models struggle to maintain repository-level context, often relying on costly retrieval or brittle fine-tuning that fails as codebases evolve. Code2LoRA uses a hypernetwork to map repository context into LoRA adapter weights, injecting project-specific knowledge with zero inference-time token overhead. On the static track, Code2LoRA-Static achieves 63.8% exact match, outperforming retrieval-based methods and matching the per-repository fine-tuning upper bound.

Paper Primer

The framework operates on two axes: how knowledge enters the parameters and when it is refreshed. Code2LoRA-Static maps a single repository snapshot to an adapter, while Code2LoRA-Evo uses a Gated Recurrent Unit (GRU) to aggregate sequential code diffs, allowing the adapter to evolve alongside the codebase.

Code2LoRA-Evo maintains superior performance in evolving codebases.

On the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match. +5.2 percentage points over a shared LoRA baseline.

Parametric injection outperforms context-injection methods like RAG.

Code2LoRA-Static reaches 63.8% cross-repo exact match on the static track. +9.9 percentage points over the strongest baseline (FFT + RAG).

Why use a hypernetwork instead of standard retrieval-augmented generation (RAG)?

RAG stresses the model's context window and incurs per-query retrieval costs. Code2LoRA distills repository knowledge into model parameters, eliminating inference-time token overhead.

What is the primary advantage of the Code2LoRA-Evo variant?

It tracks repository changes commit-by-commit. By using a GRU to aggregate diffs, it refreshes the adapter as the codebase evolves, preventing the "stale adapter" problem inherent to static snapshots.

Introduction

We expose the need for repository context and introduce Code2LoRA to inject it without token overhead.

Code language models must understand imports, APIs, and project‑specific conventions that span an entire repository. Existing approaches either prepend massive retrieved code (RAG) or fine‑tune a separate LoRA adapter per repository, both of which become costly and brittle as code evolves.

Code2LoRA uses a hypernetwork to generate LoRA adapters that embed a repository’s knowledge directly into the model, eliminating any extra tokens at inference time.

The key trade‑off is that larger context windows increase inference latency, whereas Code2LoRA injects repository knowledge without expanding the context.

Related Work

Survey of prior LoRA adapters, hypernetwork generators, and repository-aware code models.

Parameter‑efficient fine‑tuning (PEFT) has become the dominant paradigm for adapting large models, with LoRA as its flagship technique.

LoRA injects a low‑rank update into a frozen model, letting a tiny set of parameters capture task‑specific knowledge.

Quantized LoRA that stores adapters in 4‑bit precision, further cutting memory while preserving accuracy.

Dynamic‑rank LoRA that adapts its rank per layer based on task difficulty.

Technique for combining multiple LoRA adapters into a single set of weights.

Routing mechanism that selects among several LoRA modules per input token.

Application of LoRA to code generation tasks, fine‑tuning only the code‑specific layers.

Mixture‑of‑Language‑Experts LoRA, training a separate adapter per programming language.

Uses a hypernetwork to generate task‑specific LoRA adapters from a task description.

Hypernetwork‑driven LoRA generation for cross‑task generalization.

Single‑pass generation of adapters conditioned on the input context.

Factorized hypernetwork that produces LoRA parameters with reduced computational cost.

Maps a short textual task description to a LoRA adapter in a single forward pass.

Generates LoRA adapters from an entire document using per‑layer activations of a frozen LLM.

Injects cross‑file repository context directly into the model input.

Iteratively retrieves relevant repository snippets and generates code conditioned on them.

Selective retrieval architecture that attends over a repository index.

Jointly models in‑file and cross‑file context for code generation.

Repository‑aware completion that augments the decoder with retrieved code snippets.

Uses a semantic graph of the repository to retrieve relevant nodes for generation.

Benchmark evaluating code generation across multiple repositories.

Evaluation suite focusing on repository‑level code completion.

Base code language model used for experiments in this work.

Recent open‑source code language model.

Large‑scale code generation model trained on diverse repositories.

State‑of‑the‑art code model with extensive pretraining on open‑source code.

The Code2LoRA Framework

We describe how a hypernetwork turns repository embeddings into LoRA adapters for a frozen code model.

The method replaces costly full‑repository encoding at inference with a lightweight hypernetwork that produces LoRA adapters on the fly.

A tiny neural network that consumes a fixed‑size repository embedding and outputs the low‑rank matrices ($A_m$, $B_m$) needed to modify a frozen language model.

Why not fine‑tune the whole model instead of using a hypernetwork?

Fine‑tuning would require back‑propagating through billions of parameters for every repository, which is infeasible at scale. The hypernetwork isolates learning to a few hundred megabytes while still injecting repository‑specific knowledge via LoRA.

Each file $f_i$ is split into 4096‑token chunks with 512‑token overlap.

Every chunk is embedded by the frozen $Qwen3$ model; the chunk embeddings are mean‑pooled to obtain a file vector $f_i \in \mathbb{R}^{1024}$.

For the whole repository, each $f_i$ receives a weight $w_i$ based on distinctiveness, size, and path importance.

The repository embedding $e = [\text{weighted\_mean}(f_i),\; \max_i f_i] \in \mathbb{R}^{2048}$ concatenates a weighted mean and a max‑pool.

The embedding $e$ is pre‑computed and stored; gradients never flow back through the encoder.

Given a single repository embedding $e$, the static hypernetwork predicts all LoRA adapters in one forward pass via a shared MLP trunk and per‑module heads.

MLP($e$) = $e$ (identity), GELU($e$) ≈ $[0.84, -0.84, 0, 1.76]$.

L2‑norm of the result is $\sqrt{0.84^2+(-0.84)^2+0^2+1.76^2}\approx2.0$, so $h = [0.42, -0.42, 0, 0.88]$.

Head$^A_{\text{q}}$ is a linear map $W_A = \begin{bmatrix}0.5&0\\0.5&0\\0&0.5\\0&0.5\end{bmatrix}$; $A_{\text{q}} = \tanh(W_A h) = \tanh([0.21, -0.21, 0.44, 0.44]) \approx [0.21, -0.21, 0.41, 0.41]$.

Head$^B_{\text{q}}$ uses $W_B = I$; $B_{\text{q}} = \tanh(h) \approx [0.40, -0.40, 0, 0.71]$.

Injecting into $W$ yields $W' = W + \frac{\alpha}{16} B_{\text{q}} A_{\text{q}}^\top$; the outer product adds a small low‑rank correction.

The static hypernetwork turns a single high‑dimensional code summary into a set of consistent adapters, avoiding per‑layer re‑encoding.

How does this differ from applying a separate LoRA per transformer layer?

Instead of learning independent adapters for each layer (which would multiply parameters), the static hypernetwork shares one adapter pair across all layers, guaranteeing consistent adaptation and dramatically reducing the trainable parameter count.

Processes a chronological stream of diff embeddings with a GRU, updating a hidden state that feeds the same MLP‑head pipeline to produce adapters that evolve as the repository changes.

Step 1: $z_0$ is initialized by a linear projector from the static snapshot (here $[0,0]$).

Step 2: Feed $e_1$ to the GRU → $z_1=[0.5,0]$.

Step 3: Use $z_1$ in the static head to produce adapters $A_m(z_1), B_m(z_1)$.

Step 4: Feed $e_2$ to the GRU together with $z_1$ → $z_2=[0.5,0.5]$.

Step 5: Generate a new adapter pair from $z_2$, reflecting the repository’s evolution.

The recurrent hypernetwork updates adapters incrementally, so each commit only incurs a cheap GRU step instead of a full re‑encoding.

Why not simply concatenate all diff embeddings and feed them to the static MLP?

Concatenating would produce a very high‑dimensional input, exploding the MLP’s parameter count. The GRU compresses the sequence into a fixed‑size hidden state, preserving temporal order while keeping the hypernetwork lightweight.

**Figure 1.** Code2LoRA architecture. (a) Overall pipeline: repository context is encoded and mapped to LoRA adapters, which are injected into a frozen LLM to support inference (example task: assertion completion). (b) Code2LoRA-Static's static hypernetwork. (c) Code2LoRA-Evo's recurrent hypernetwork.

**Figure 4.** Detailed Code2LoRA-Static architecture. (1) Repository-level context is encoded by a frozen embedding model (Qwen3-Embedding-0.6B) and aggregated into a 2048-dim repository embedding $e_{repo}$; the result is stored in the dataset and consumed verbatim at training time—gradients never flow back through the embedder. (2) A shared MLP trunk (2-layer GELU, hidden $H=512$) maps $e_{repo}$ to a hidden representation $h$ (L2-normalized, rescaled by $\sqrt{H}$); separate $Head^A_m, Head^B_m$ heads emit $A_m, B_m$ for each of the 7 projection types via $\tanh \cdot \exp(s_m)$ scaling with a clamped learnable log-scale $s_m$. The same $(A_m, B_m)$ pair is shared across all 28 transformer layers. (3) Generated LoRA weights are injected into the frozen LLM via $W' = W + \frac{\alpha}{r} B_m A_m$. Only the hypernetwork parameters $\theta$ are trained via the language-modeling loss (dashed red); the LLM and embedder stay frozen.

**Figure 5.** Detailed Code2LoRA-Evo architecture and training procedure. (1) Per-commit production-code diffs $\Delta_t$ and the initial repository snapshot are encoded by the shared frozen embedder into 2048-dim vectors $\{e_t\}_{t=1}^T$ and $e_{\text{repo}}^{(0)}$; the resulting embeddings are stored in the dataset. (2) A small repo-state initializer (Linear $\rightarrow$ GELU $\rightarrow$ LayerNorm) maps the static snapshot $e_{\text{repo}}^{(0)}$ to the initial hidden state $h_0 \in \mathbb{R}^{2048}$. (3) A 1-layer GRU walks the chronological diff sequence; each step projects $e_t$ with a Linear + LayerNorm and applies the GRU recurrence to produce $h_t$. Truncated BPTT detaches the hidden state every $K=16$ steps. (4) The final state $h_T$ is fed (after LayerNorm) into Code2LoRA-Evo's LoRA-generation projection head (analogous in design to Code2LoRA-Static's; Figure 4): a 2-layer GELU trunk with L2-norm rescaling, plus per-module-type $\text{Head}_t^A/\text{Head}_t^B$ output heads with $\tanh \cdot \exp(s_m)$ scaling. The resulting $(A_m, B_m)$ are shared across all 28 transformer layers per type (5) Generated LoRAs are injected into the frozen LLM ($W' = W + \frac{\alpha}{r} B_m A_m$); training minimizes the cross-entropy loss on the assertion target. Gradients (dashed red) flow through the projection head, GRU, and repo-state initializer; the LLM and embedder stay frozen.

RepoPeftBench Benchmark

RepoPeftBench defines repository‑level PEFT tasks and two tracks to assess static and evolving code.

Existing code‑language‑model benchmarks either expose only isolated snippets or require per‑instance retrieval, making it impossible to evaluate how a model adapts to the continual evolution of a real codebase.

RepoPeftBench supplies a full‑repository, parameter‑efficient fine‑tuning benchmark so models can be tested on realistic, evolving Python projects.

How does RepoPeftBench differ from prior code‑completion benchmarks?

Prior benchmarks either give isolated functions (LiveCodeBench) or require a retrieval slice per instance (RepoBench, CrossCodeEval). RepoPeftBench instead releases the entire repository, letting a PEFT method see all code and adapt continuously as the repo evolves.

The two tracks expose opposite extremes: a frozen snapshot versus a stream of commits, testing whether a model can stay current as code changes.

Why isn’t a single snapshot sufficient for evaluating models on evolving repositories?

A snapshot ignores the bursty commit patterns that introduce new assertions and change surrounding code. Without seeing those changes, a model cannot be judged on its ability to adapt its LoRA adapters over time, which is the core capability RepoPeftBench is designed to measure.

**Figure 2.** Bursty commit pattern, illustrated using randomly selected 5 repositories out of the 604 RepoPeft-Bench repositories. Test-touching commits arrive irregularly; the median repository accumulates over 100 such commits, motivating per-commit (rather than one-shot) adaptation under software evolution.

**Figure 3:** Token length distributions for prefix-only (left) and DRC+prefix (right) input formats across all splits. Vertical dashed lines mark common context window sizes. Prefix-only inputs are compact (median 224 tokens), while DRC+prefix inputs have a heavy right tail requiring larger context windows.

The benchmark’s evolution track makes clear why evaluating models on continuously changing codebases is essential.

Experimental Setup and Results

We evaluate Code2LoRA on RepoPeftBench using Qwen models and report EM, EditSim, and CodeBLEU.

All experiments share the Qwen2.5‑Coder‑1.5B backbone (bfloat16) and the Qwen3‑Embedding‑0.6B repository encoder. Code2LoRA generates rank‑16 LoRA adapters with $\alpha$ = 32, yielding ≈ 720 M (static) or ≈ 745 M (evolution) trainable parameters, trained for three epochs on a single H100 GPU. We compare against seven baselines—including Pretrained, RAG, DRC, FFT, Single LoRA, Per‑repo LoRA, and Text2LoRA—using EM, EditSim, and CodeBLEU as metrics.

**Table.** Performance comparison of different methods on Cross-Repo (CR) and In-Repo (IR) test sets using EM (%), EditSim, and CodeBLEU metrics.

**Table 2.** Results on RepoPeftBench static track.

**Figure 6.** Per-repository EM distribution on the IR-test split of RepoPeftBench (Table 3 checkpoints; $n=389$ repositories common to all methods). Each violin shows the full distribution of per-repository EM for one method; the inner box reports the IQR and the white dot marks the median. Code2LoRA-Static (median 62.5%, $\sigma=16.8$) and Code2LoRA-Evo (median 66.7%, $\sigma=15.8$) achieve consistently high performance with substantially lower variance than per-repo LoRA (median 62.5%, $\sigma=20.9$); per-repo LoRA falls below the pretrained baseline on 10.5% of repositories versus only 1.3% and 1.8% for Code2LoRA-Static and Code2LoRA-Evo, demonstrating the regularizing effect of cross-repository knowledge transfer.

Performance Analysis

Code2LoRA variants beat baselines across static, evolution, and OOD evaluations on RepoPeftBench.

Code2LoRA‑Evo attains the highest exact‑match score on the RepoPeftBench OOD set, surpassing the next‑best fine‑tuned adapter by +1.8 pp.

Table 4 reports 74.1 % EM for Code2LoRA‑Evo versus 72.3 % for Single LoRA.

On the evolution track, Code2LoRA‑Evo reaches 60.3 % CR EM and 64.5 % IR EM, outpacing the strongest fine‑tuned baseline (Single LoRA) by +5.2 pp on CR and exceeding the per‑repo LoRA upper bound on IR without any per‑repository training.

**Table 4.** Results on RepoPeftBench OOD set.

**Table 1.** Comparison of different methods across EM (%), EditSim, and CodeBLEU metrics.

**Figure 7.** Per-repo LoRA EM vs. training-set size on IR test. Repositories with fewer than 50 training pairs frequently underperform the IR-test pretrained baseline (46.8%), while Code2LoRA-Static maintains stable performance regardless of per-repo data availability.

**Figure 8.** CR-test EM as a function of training repository count. Code2LoRA-Static benefits from repository diversity, with performance improving log-linearly.

**Figure 9.** CR-test exact-match vs. normalized commit position (51 held-out repositories, commit-derived prefixes). Each repository's timeline is scaled to 0–100%; points are qna-weighted means per 5% bin.

**Figure 10.** t-SNE of generated LoRA adapters for 52 CR-test repositories (PCA pre-reduction to 50 dims, then t-SNE). Color indicates per-repo Exact Match (%). Repositories with similar codebases tend to cluster together, and clusters show coherent EM ranges, demonstrating that the hypernetwork learns a smooth, semantically meaningful adapter manifold.

Limitations

We outline the current limits of Code2LoRA’s evaluation scope, OOD behavior, and safety considerations.

We evaluate only Python repositories, using a single frozen backbone (Qwen2.5‑Coder‑1.5B) and one downstream task—assertion completion from pytest/unittest suites. Although the architecture is designed to be language‑ and task‑agnostic, extending empirical evidence to other languages, backbones, and tasks is left for future work.

The OOD exact‑match (EM) score of 74.1 % may be partially inflated because assertion targets in post‑cutoff OOD repositories are systematically shorter (median 7 characters) than in the CR/IR test set (median 12–13). Consequently, we emphasize the within‑OOD comparison, where Code2LoRA‑Evo leads the next‑best fine‑tuned adapter by roughly 1.8 pp EM.

Exact‑match accuracy misses functional equivalence, so we also report EditSim, CodeBLEU, and a pytest‑based execution probe on a runnable CR‑test slice. A more semantic evaluation—executing every generated assertion against the project’s test runtime—is a natural extension but was out of scope for this submission’s compute budget.

The LoRA‑generation hypernetwork dominates the trainable parameter count (≈720 M for Code2LoRA‑Static and ≈745 M for Code2LoRA‑Evo), scaling with the backbone’s projection dimensions. While the evolution‑track finding is directly supported at the 1.5 B‑parameter scale, it remains an open question whether recurrent aggregation over commit diffs stays necessary—or sufficient—once the backbone grows substantially.

All code, the RepoPeftBench benchmark, and hyperparameters (see Appendix D) will be released upon acceptance; experiments were run on a single H100 80 GB GPU. This reproducibility statement underscores that the reported results are tied to a specific hardware configuration.

RepoPeftBench is built exclusively from publicly available, permissively licensed Python repositories, so the dataset does not introduce new personal data, harmful content, or proprietary code. Nevertheless, the downstream artifact—a repository‑conditioned LoRA—inherits the well‑understood risks of code LLMs: insecure, incorrect, or license‑resembling completions, with amplified attribution risk if a private repository is used. We make no safety claims for production deployment without standard mitigations such as license‑aware filtering, human review of generated assertions, and rejection of verbatim training spans.

Dependency-Resolved Context

Ablation studies quantify the effect of Dependency‑Resolved Context on Code2LoRA.

Code2LoRA relies on repository‑level context; this appendix measures how the Dependency‑Resolved Context (DRC) component contributes to that capability.

DRC is like a librarian who, before you start reading a chapter, fetches every definition that the chapter’s imports reference, so the model sees the exact code it will need.

How does DRC differ from simply concatenating the whole repository?

Concatenating the entire repo would far exceed the $8\text{K}$‑token limit and introduce irrelevant noise. DRC selects only the definitions reachable from the imports used in the prefix and compresses them, guaranteeing both relevance and budget compliance.

The token overhead means that models using DRC must operate with longer sequences, which modestly increases compute but yields substantially richer context for the majority of test cases.

**Table 9.** Training hyperparameters. The “+DRC” column shares all settings with Code2LoRA-Static and adds a 4K-token dependency-resolved context budget injected ahead of the prefix. The commit-derived results in Tables 3–4 use analogous V2 trainers (1 epoch, batch 1, grad-accum 16, max seq 4,096); see §D.5 and the released code for full details.

Compute‑resource accounting shows that adding DRC raises the training cost of Code2LoRA‑Static from $17\,$h to $18\,$h on a single H100 80 GB GPU, a modest increase relative to the baseline.

Overall, the ablations confirm that DRC supplies useful repository definitions for roughly two‑thirds of examples while incurring only a small token and compute overhead, validating its design as a lightweight context‑augmentation mechanism.

Conclusion

The conclusion synthesizes the comparative findings and outlines future directions.

Across both the static and evolution tracks, parametric adaptation consistently beats context injection, with the smallest performance drift among fine‑tuned methods. On the out‑of‑distribution set of 92 post‑cutoff repositories, Code2LoRA‑Evo attains the highest exact match score (74.1 %), surpassing Code2LoRA‑Static (72.2 %) and Single LoRA (72.3 %). The OOD assertions are also notably shorter than in‑distribution ones.

We introduced the Code2LoRA hypernetwork framework and the RepoPeftBench benchmark, offering two scenarios: a static adapter yielding 63.8 % CR / 66.2 % IR EM, and an evolutionary adapter achieving 60.3 % CR / 64.5 % IR EM. These experiments confirm that injecting repository knowledge parametrically and updating it to follow software evolution outperforms long‑input context approaches. We envision Code2LoRA becoming a modular component for richer, cheaper AI code assistants.

Appendix: Dataset Details

Dataset construction, splits, and analysis for RepoPeftBench.

Our assertion‑completion task mirrors the code‑execution probe of LiveCodeBench, but replaces hand‑curated snippets with real assertions extracted from test suites. This preserves the multi‑step, type‑aware reasoning demand while tying each prediction to a full repository’s API, naming conventions, and fixtures.

We collected repositories via the GitHub search API using the query language:python license:mit stars:>=300 pushed:>=2023‑01‑01, then filtered for pytest/unittest usage. An OOD holdout was built from later‑created repos (post 2025‑04‑01) without the star threshold, yielding 92 additional repositories that may carry Apache‑2.0 licenses.

Test files are recognized by common naming patterns (e.g., test_*.py) or placement in tests/ directories and are moved to a dedicated `TEST_HYPERNET`/ folder. For each QnA we prepend a structured prefix: all imports, the enclosing class (if any), helper methods, and the test function header up to the assertion point.

The experimental splits follow the configurations listed in Tables 2–4 of the main paper; they are reproduced here for completeness.

Table 6 shows that the simple assert keyword dominates (≈ 82–86 % of pairs) and that target‑type frequencies differ by at most two percentage points across train, CR, and IR splits, ruling out distribution shift as the cause of cross‑repo performance gaps.

**Table 5.** Fine-grained statistics for every split actually consumed by the main tables. Static track: one anchor snapshot per repository (rows feed Table 2). Evolution track: multi-commit prefixes (rows feed Tables 3 and 4); the smart cap ($\le 4$ QnAs per test file, $\le 8$ per commit) is applied to Code2LoRA-Evo training rows so that no commit can dominate a backprop window.

Token‑length statistics are detailed in Table 7: repositories are large (median 165 K tokens), DRC contexts are moderate but heavy‑tailed (median 517 tokens), prefixes are compact (median 224 tokens), and targets are very short (median 3 tokens).

**Table 7.** Token length statistics across the 62,294 static-track QnAs (Qwen2.5-Coder-1.5B tokenizer). Repo size is the total token count of all Python source files per repository (repeated per pair). DRC statistics are over the 64.1% of pairs with resolvable dependency context.

For per‑repository analysis we provide a full table (supplementary) covering 409 IR‑test repositories with EM, EditSim, CodeBLEU, and example counts for each method, summarized in Figures 6 and 7.

**Table.** Distribution of assertion types and target types across Train, CR Test, and IR Test sets.

**Table 8.** RAG ablation over chunk size and k on CR and IR test. Top: pretrained + RAG; bottom: trained models + RAG at inference.

All non‑test source files and test files are redistributed under their original permissive licenses (MIT or Apache‑2.0); no private code, commit messages, or issue discussions are included, and we rely on the upstream licenses to permit research‑use redistribution.

Supporting Analysis

Additional analyses and tables that support the main results.

This appendix expands the core experiments with deeper diagnostics, scaling studies, and efficiency measurements.

**Table 10.** Effect of training-repository count on CR-test EM.

**Figure 11.** Comparison of per-module weight norms. Top: Code2LoRA-Static generates repo-specific LoRA adapters with varying weight distributions across module types. Bottom: FFT+DRC applies a uniform weight delta. Code2LoRA-Static's structured, repo-specific adaptations explain its stronger cross-repo performance.

Read the original paper

Open the simplified reader on Paperglide