How LoRA Remembers? a Parametric Memory Law for LLM Finetuning

A power law for LLM memory capacity reveals that token-level probability thresholds govern verbatim recall.

How do LoRA rank and memory length interact to govern an LLM's ability to memorize exact, verbatim information?

Standard fine-tuning treats all tokens equally, often failing to achieve the exact, verbatim recall required for reliable knowledge updates. The authors use Low-Rank Adaptation (LoRA) as a controlled probe to establish a power law linking parameter budget and sequence length to memory gain, while identifying a critical probability threshold of $p > 0.5$ for deterministic token recall. They introduce Memorization-oriented Fine-Tuning (MemFT), which dynamically reallocates the training budget to prioritize "stubborn" tokens that fall below this threshold. This approach significantly improves memory fidelity and efficiency compared to standard fine-tuning, achieving perfect recall in constrained parameter regimes.

Paper Primer

The paper frames exact parametric memory as a "pure parameter writing" task, decoupling it from downstream comprehension. By treating LoRA rank as a monotone capacity knob, the authors isolate the relationship between parameter budget, sequence length, and loss reduction.

The Parametric Memory Law accurately predicts memory capacity across diverse semantic and random-token distributions.

The empirical model $\Delta L = C \cdot r^{\alpha} \cdot \ell^{-\beta} + b$ achieves $R^2 > 0.98$ across multiple LLM architectures and memory benchmarks. The law holds robustly regardless of semantic density, providing a unified scaling characterization for both long-context and short-text memory.

MemFT consistently outperforms standard fine-tuning in exact recall fidelity.

In the Long-Context Memorization Stress Test, MemFT variants achieve 100% accuracy in high-rank settings where standard fine-tuning fails to saturate. MemFT-SW maintains a stable lead in factual recall benchmarks, reaching perfect exact-match accuracy significantly faster than standard methods.

Why is average cross-entropy loss an unreliable metric for evaluating exact memory?

Average loss masks local bottlenecks where token probabilities remain below the $p=0.5$ threshold. Because autoregressive generation is sensitive to single-token errors, these "stubborn" positions trigger cascading failures that collapse the entire sequence despite low global loss.

What is the significance of the $p=0.5$ threshold in this framework?

It represents a deterministic phase transition: when the target token probability exceeds 0.5, it is guaranteed to be the most probable candidate under greedy decoding. Tokens below this threshold exist in a "disordered phase" where they are susceptible to being outcompeted by incorrect candidates.

Researchers should shift from uniform fine-tuning to threshold-guided optimization when the goal is exact, verbatim recall. Prioritizing tokens that have not yet crossed the deterministic phase transition is a more efficient use of parameter budget than global loss minimization.

Introduction

We expose why standard fine‑tuning fails at verbatim recall and introduce a scaling law for LoRA‑based memory.

Standard fine‑tuning spreads gradient updates evenly over every token, so the model never allocates extra capacity to the few tokens that must be memorized verbatim. Consequently, large language models struggle to store exact text despite their huge parameter counts.

Parametric Memory refers to the portion of the model’s latent representation that can be deliberately written to and later retrieved by a low‑rank adaptation (LoRA) module. It acts as an internal “scratchpad” where factual snippets can be encoded directly into weights.

Increasing the LoRA rank $r$ expands the size of the internal scratchpad, while longer memory sequences $\ell$ dilute each token’s share of that capacity; the loss reduction $\Delta L$ follows a simple power‑law trade‑off.

At the token level, a prediction probability $p>0.5$ is sufficient to lock a token into a stable state under greedy decoding; below this threshold the token competes with alternatives, leading to high‑entropy cascades. It is like a lock that clicks only when the key is turned past the halfway point—half‑turns leave it wobbling and prone to slip.

**Figure 1.** LoRA as a pluggable memory unit in the LLM's latent space. The LoRA module (rank $r$) encodes contextual knowledge into the residual stream at layer $k$, enabling faithful recall of memorized text. The Parametric Memory Law quantifies the capacity-parameter trade-off.

LoRA functions as a lightweight, plug‑in memory unit whose capacity obeys a simple power law, enabling precise budgeting of parametric memory.

Task Setup and Definitions

Defines the exact parametric memory task and the LoRA injection mechanism.

The memorization problem is cast as pure parameter writing: a frozen base model receives a low‑rank update that must encode each answer verbatim. By stripping away any retrieval or contextual reasoning, the task isolates the memory capacity of the parameter budget.

Given a frozen model, we learn a tiny additive weight block that stores each answer directly in the parameters, so that feeding the corresponding key query reproduces the answer exactly.

Initialize $\Delta\theta$ to zero; the model outputs garbage for both answers.

During training, gradient updates modify the $1\times d_{\text{in}}matrixA$ and $d_{\text{out}}\times1matrixB$ so that $f_{\theta}(q^{(1)})$ matches its three target tokens.

After a few epochs, feeding $q^{(1)}$ yields the exact three‑token sequence $a^{(1)}$, while $q^{(2)}$ still fails.

Continuing training adjusts the same low‑rank branch to also store $a^{(2)}$, achieving perfect recall for both pairs.

Even a single‑rank update can memorize multiple short answers because the residual branch is applied at every layer, effectively distributing the stored information throughout the network.

Standard fine‑tuning trains all model parameters with a uniform cross‑entropy loss over every token, treating each token equally regardless of its role as key or answer.

Metrics are computed only on answer tokens using greedy decoding, which yields a deterministic output $\hat a_t=\arg\max_{v\in V}p_{\theta}(v\mid q,a_{<t})$. We track three complementary measures: the sequence‑averaged cross‑entropy loss $L$, token‑level accuracy $\text{Acc}_{\text{tok}}$, and exact‑match accuracy $\text{Acc}_{\text{EM}}$.

LoRA adds a low‑rank residual $BAx$ to each frozen linear layer, letting a tiny set of parameters $r$ control how much new information can be written into the model.

The Parametric Memory Law

Derives a power‑law linking LoRA rank, sequence length, and loss reduction.

Standard fine‑tuning spreads gradients uniformly, so practitioners lack a principled way to predict how increasing LoRA rank $r$ or shortening the context $\ell$ will improve memorization. Our experiments on Qwen3‑8B‑IT and Llama3.1‑8B‑IT expose a systematic scaling pattern that resolves this uncertainty.

Compute the rank term: $r^{\alpha}=2^{1}=2$.

Compute the length term: $\ell^{-\beta}=4^{-1}=0.25$.

Multiply: $C \cdot r^{\alpha} \cdot \ell^{-\beta}=1 \times 2 \times 0.25 = 0.5$.

Add the baseline: \$0.5 + b = 0.5 + 0.5 = 1.0$.

Thus $\Delta L = 1.0$, meaning the loss drops by one full unit under these settings.

The example shows that doubling rank while halving length yields the same gain as a single rank increase with unchanged length, illustrating the trade‑off encoded by the exponents.

We gathered data by sweeping $r$ and $\ell$ across the two benchmark families: a Long‑Context Memorization Stress Test that mixes random tokens at varying ratios, and a Short‑Context Dense Memory Test using Phone‑Book key‑value pairs. Samples where the final loss fell below $0.69$ were discarded to avoid saturated regimes that would mask scaling effects.

**Figure 2.** Empirical validation of the Parametric Memory Capacity Law (LoRA on Qwen3-8B). (a) $\Delta\mathcal{L}$ exhibits approximate log-linear decay with respect to rank $r$ and length $\ell$, forming a nearly planar structure in log-space; (b) The scatter plot compares predicted $\Delta\mathcal{L}$ from Eq. (6) against true values, showing high fidelity ($R^2 = 0.996$); (c) Heatmaps plot the final loss and token-level accuracy (correct tokens / total length) across various $(r, \ell)$ settings, revealing numerous cases where loss approaches zero while accuracy remains near zero.

Table 1 quantifies the fit quality: both models achieve $R^{2}>0.98$ and low MAPE across all random‑token mixtures and the Phone‑Book benchmark, confirming that the Parametric Memory Law holds uniformly across diverse data regimes.

Token-Level Memory Dynamics

Identifies a loss threshold that separates random from reliable memorization.

The average cross‑entropy loss can be near zero while token‑level accuracy remains poor because loss smooths over hard tokens. Persistent low‑probability tokens (p < 0.5) create bottlenecks that trigger catastrophic autoregressive collapse.

When greedy decoding picks the highest‑probability token, the target token must hold more than half the probability mass; crossing the 0.5 probability line flips memory from unreliable to guaranteed.

Why not use a higher confidence margin (e.g., p > 0.7) instead of the 0.5 threshold?

Because greedy decoding only requires the target to be the most probable token. Any probability above 0.5 already guarantees uniqueness; raising the margin would be a stricter heuristic that discards many configurations that already succeed deterministically.

Crossing the loss barrier L < 0.693 guarantees deterministic memory success under greedy decoding.

Derived from the probability dominance condition p > 0.5 and the identity L = −log(p).

**Figure 4.** Training convergence of Qwen3-8B on the Random / Long-Context Memorization Stress Test. Each subplot corresponds to a fixed memory length and LoRA rank. The overlaid curves represent different random-token mixture settings. The consistent decrease and stabilization of training loss indicate that the LoRA adapters are sufficiently optimized across the full sweep.

**Figure 5.** Training convergence of Llama3.1-8B on the Random / Long-Context Memorization Stress Test. The figure follows the same layout as Figure 4, with each subplot corresponding to a fixed length-rank configuration. The curves show stable optimization behavior across random-token mixture settings, supporting the reliability of the subsequent accuracy comparisons.

**Figure 10.** Per-position probability grid for the Long-Context Memorization Stress Test Random 100% scenario with $r \in \{8, 10, 12, 14, 16\}$. This rank range aligns with the LongBench-mixed scenarios for direct comparison.

**Figure 11.** Per-position probability grid for the Random (100%) scenario with $r \in \{48, 64, 128, 256, 512\}$.

**Figure 12.** Per-position probability grid for the Long-Context Memorization Stress Test Random 20% scenario. With 80% semantically coherent tokens from LongBench, the model memorizes more easily and stubborn positions appear only at longer lengths.

**Figure 13.** Per-position probability grid for the Long-Context Memorization Stress Test Random 60% scenario. With 40% semantically coherent tokens, the difficulty is intermediate between the Random 100% and Random 20% settings, and stubborn positions emerge at shorter lengths compared to Figure 12.

**Figure (c).** Localization of failure positions

MemFT Methodology

MemFT reshapes fine‑tuning by weighting loss toward tokens that have not yet memorized.

Standard SFT spreads gradient updates uniformly, wasting capacity on tokens that are already memorized. The core trick is to concentrate learning on tokens that remain in the uncertain regime.

MemFT replaces the uniform cross‑entropy loss with a token‑weighted objective that amplifies gradients for hard‑to‑memorize tokens while suppressing updates on already‑memorized ones.

How does MemFT differ from standard SFT’s uniform loss weighting?

Standard SFT treats every token equally, multiplying each cross‑entropy loss by the same scalar. MemFT introduces per‑token weights $w_t$ that zero‑out already‑memorized tokens and amplify the loss on uncertain tokens, so gradients are directed where they are needed.

Compute the threshold mask $w^{\text{TH}}_t = \mathbf{1}[L_t > 0.6]$, yielding $w = [0,\,1,\,0,\,1,\,0]$.

Weighted numerator $\sum_t w_t L_t = 0.8 + 1.2 = 2.0$.

Weighted denominator $\sum_t w_t + \varepsilon = 2 + 0.01 = 2.01$.

MemFT loss $\text{LMemFT} = 2.0 / 2.01 \approx 0.995$.

By discarding the easy tokens (loss < $L_{\text{crit}}$), MemFT concentrates the gradient on the two hard tokens, yielding a sharper training signal without changing the overall scale.

**Table 4.** Representative exact-memory scenarios across 8 domains. All tasks require verbatim recall because even minor deviations, such as a single-character, punctuation, or formatting error, can invalidate the target, alter its operational meaning, or introduce legal/security risks.

Experimental Results

We recap the memory bottleneck and then show how MemFT closes the gap.

Standard fine‑tuning spreads gradient updates uniformly, so large models cannot fully exploit the scaling laws that limit memory capacity.

This synthetic benchmark fills a sequence with random tokens, removing linguistic cues, to probe the raw parametric memory limit of a model.

A short‑text lookup task where the model must memorize explicit key‑value pairs (e.g., name‑to‑phone‑number) and reproduce them exactly.

MemFT‑SW attains perfect exact‑match recall (100 % EM) at the smallest LoRA rank where SFT still lags, namely rank 6 for Qwen3‑8B and rank 7 for Llama3.1‑8B.

Table 2 shows MemFT‑SW reaches 100 % EM at $p_6$ (Qwen) and $p_7$ (Llama) while SFT peaks at 92 % under the same budgets.

**Figure 7.** Exact-match accuracy of Qwen3-8B on the Random / Long-Context Memorization Stress Test. Each subplot corresponds to one LoRA rank, the x-axis denotes memory length, and the y-axis denotes exact-match accuracy. Curves compare SFT, MemFT-OT, and MemFT-SW, showing how each method scales with increasing memory length under a fixed rank.

**Figure 8.** Exact-match accuracy of Llama3.1-8B on the Random / Long-Context Memorization Stress Test. The layout matches Figure 7: each subplot fixes one LoRA rank and compares SFT, MemFT-OT, and MemFT-SW across memory lengths.

**Figure 9.** Exact-match accuracy on the PhoneBook benchmark for Qwen3-8B and Llama3.1-8B. The upper panels correspond to Qwen3-8B and the lower panels correspond to Llama3.1-8B. Each subplot fixes one LoRA rank and plots exact-match accuracy as a function of answer-token length. Curves compare SFT, MemFT-OT, and MemFT-SW, providing a full view of how model, rank, length, and training method jointly affect short key-value memorization.

Table 3 further demonstrates that MemFT improves both memory retention and downstream generalization on the Linear Rule Learning benchmark, with gains of up to +15 % in generalization at the smallest rank.

Analysis and Generalization

Limits of exact recall and broader effects of MemFT.

LLMs struggle with exact memorization because standard fine‑tuning spreads gradients uniformly, ignoring the scaling laws that bind LoRA rank and sequence length. MemFT addresses this by shifting gradient budget from tokens that are already reliably recalled to those that sit below the deterministic recall threshold, thereby expanding the model’s exact‑memory capacity under limited LoRA rank.

**Figure 6.** Training convergence of Llama3.1-8B on the PhoneBook benchmark. Each subplot corresponds to a fixed answer-token length and LoRA rank. The curves compare SFT and MemFT-OT under the same configuration, showing that the PhoneBook runs are sufficiently optimized before evaluating exact-match recall.

To test whether focusing on exact recall harms broader learning, we introduced a Linear Rule Learning benchmark. MemFT maintains or improves performance on this task, demonstrating that the reallocation strategy does not sacrifice generalization while boosting verbatim recall.

Related Work

We situate our work among non‑parametric and parametric memory approaches, highlighting LoRA‑based probing.

Related work splits LLM memory strategies into non‑parametric approaches that retrieve external information at inference time, and parametric approaches that embed knowledge directly in model weights. The former are limited by fixed context windows and attention dilution, while the latter lack quantitative analysis of capacity. Below we catalogue representative methods in each class.

Generates responses by conditioning on a prompt that contains examples, effectively using the model’s context as a non‑parametric memory.

Augments generation with an external datastore that is queried at inference time, retrieving relevant documents to condition the model.

Systems that attach dedicated memory modules (e.g., key‑value stores, differentiable neural computers) to language models, allowing read/write operations during inference.

Architectural or algorithmic modifications (e.g., sparse attention, sliding‑window attention) that extend the effective context length of transformers.

Encodes knowledge directly in model weights or modular parameter structures, enabling retrieval‑free reasoning.

Uses low‑rank adaptation modules as plug‑in memory slots that can be trained to encode new facts while keeping the base model unchanged.

Appendix

Appendix details dataset construction, LoRA setup, and curriculum hyperparameters.

We evaluate exact parametric memory with two controlled benchmarks. The Long‑Context Memorization Stress Test injects synthetic key‑value pairs into random LongBench sequences and replaces 0 %–100 % of tokens with random vocabulary items to vary semantic coherence. The PhoneBook benchmark is reduced to pure key‑value pairs by dropping the context field and deduplicating keys so each query maps to a unique phone number.

All experiments freeze the base model and train only LoRA adapters, treating LoRA rank as the primary parameter budget. For the stress test we attach LoRA to the MLP down‑proj module, while for PhoneBook we attach a single LoRA block to the entire MLP. Selected layers are 20/24 for Qwen‑3‑8B‑Instruct and 18/20 for Llama‑3.1‑8B‑Instruct (stress test) and layer 24/18 respectively for PhoneBook.

Rank‑wise results are reported as averages over predefined answer‑only length buckets. For the stress test the bucket set is $L_{\text{Long}}=\{50,100,200,500,1000,2000,3000,4000,5000,6000,7000,8000,10000\}$; for PhoneBook the set is $L_{\text{PB}}=\{1\text{k},2\text{k},4\text{k},8\text{k},12\text{k},16\text{k},24\text{k},32\text{k}\}$.

PhoneBook training uses an Inter‑Batch Temporal Curriculum with fixed exposure ratios $[0.2,0.4,0.6,0.8,1.0]$; the epoch boundaries listed in Tables 5 and 6 dictate when the curriculum advances to the next ratio. Because Qwen and Llama tokenizers produce different answer‑only token counts, the schedules are reported separately for each model.

**Table 5.** Length-dependent curriculum hyperparameters for Qwen3-8B-Instruct on the PhoneBook benchmark using MemFT-SW. The approximate sample count is computed from the answer-only PhoneBook tokenization under the Qwen tokenizer. All schedules use curriculum exposure ratios [0.2, 0.4, 0.6, 0.8, 1.0].

**Table 6.** Length-dependent curriculum hyperparameters for Llama3.1-8B-Instruct on the PhoneBook benchmark using MemFT-SW. The approximate sample count is computed from the answer-only PhoneBook tokenization under the Llama tokenizer. Since PhoneBook targets are shorter under the Llama tokenizer, the same answer-only length corresponds to more training pairs than in Qwen. All schedules use curriculum exposure ratios [0.2, 0.4, 0.6, 0.8, 1.0].

Additional loss‑curve figures show that every LoRA adapter, across the full rank‑length sweep, reaches convergence; each subplot corresponds to a fixed length‑rank pair and confirms that under‑training does not explain performance differences.

Performance‑landscape visualizations plot memorization accuracy over LoRA rank, memory length, model, and training method; subplots grouped by rank reveal where each method gains or loses as length increases, complementing the averaged tables.

Token‑level teacher‑forcing probability grids for Qwen‑3‑8B (layer 24) display $p(t_i\mid t_{<i})$ across lengths and ranks; blue curves trace probabilities, red dots mark positions where $p<0.5$ (stubborn tokens), and black dotted lines indicate the free‑run first‑failure position $i^\ast$.

Read the original paper

Open the simplified reader on Paperglide