Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

A sleep-inspired paradigm for LLMs that uses periodic memory consolidation and self-generated dreaming to enable lifelong continual learning.

How can LLMs mimic human sleep to consolidate fragile, high-frequency memories into stable, long-term knowledge without catastrophic forgetting?

Large Language Models are effectively static after pre-training, forcing them to rely on fragile in-context learning that vanishes when the context window resets. The authors introduce a "Sleep" paradigm that periodically pauses active input processing to distill short-term memories into stable, long-term parametric knowledge through self-distillation and synthetic data generation. This approach enables models to acquire new skills sequentially without catastrophic forgetting, outperforming standard fine-tuning and in-context learning baselines on long-horizon and continual learning tasks.

Paper Primer

The method hinges on a two-stage "Sleep" cycle: Memory Consolidation and Dreaming. During consolidation, the model uses "Knowledge Seeding" to distill information from high-frequency, volatile memory blocks into expanded, stable low-frequency parameters. Dreaming then acts as a self-improvement phase, where the model generates synthetic data to refine its grasp of recently acquired knowledge without human supervision.

The Sleep paradigm enables robust sequential learning of new languages without catastrophic forgetting.

In sequential translation tasks (Manchu and Kalamang), standard in-context learning performance drops sharply, while the Sleep-enabled model retains nearly all gains from single-language performance. Near-total recovery of single-language performance in sequential settings.

Why is this approach more effective than standard fine-tuning or simple in-context learning?

Standard fine-tuning suffers from catastrophic forgetting, while in-context learning is limited by the context window. The Sleep paradigm avoids these by using explicit, lossy distillation to move knowledge into stable parameters, effectively "compressing" experience into long-term memory.

What is the role of the "Dreaming" phase compared to consolidation?

While consolidation stabilizes knowledge, Dreaming is a self-modifying process that uses reinforcement learning to generate synthetic data, allowing the model to explore and strengthen its newly formed connections, similar to REM sleep in humans.

For researchers in continual learning, this paper shifts the focus from static training/test splits to a lifecycle model where "sleep-time compute" is essential for maintaining and expanding a model's knowledge base over time.

Introduction

LLMs cannot separate fleeting inputs from lasting knowledge, so we propose a sleep‑inspired memory hierarchy to enable continual learning.

Large Language Models achieve impressive performance on many tasks, yet after their initial training they become static. When presented with new information they either rely on the limited context window (in‑context learning) or remain unchanged, so their knowledge quickly becomes outdated.

Updating a model with fresh data typically triggers catastrophic forgetting: the model’s proficiency on earlier tasks degrades sharply. This tension—between knowledge obsolescence and destructive updates—highlights a core instability in how LLMs handle high‑frequency inputs versus stable, long‑term knowledge.

To address this, we introduce the “Sleep” paradigm, a two‑stage process that mirrors biological sleep. First, Memory Consolidation (called Knowledge Seeding) distills the fragile memories of a smaller model into a larger one, expanding capacity while preserving what has been learned. Second, the Dreaming phase lets the model rehearse and refine its abilities autonomously via reinforcement‑learning‑driven synthetic data generation.

The sleep paradigm draws inspiration from biological sleep, where short‑term memories are replayed and integrated into stable long‑term storage.

Continual Learning Setup

Continual learners need alternating active and sleep phases to keep knowledge stable.

Continual learning removes the clean split between training and test phases; the model is always either ingesting new data or operating in isolation. Without a dedicated mechanism to separate these modes, high‑frequency inputs overwrite older knowledge, leading to catastrophic forgetting.

We therefore distinguish two operational states. In the active (wake) state the learner processes incoming tokens, while in the sleep state it receives little or no external input and focuses on reorganizing its internal representations.

**Figure 1.** (Conventional Machine Learning vs. Continual Learning) While in conventional machine learning often the lifespan of the model is divided to test and training time, continual learning setup does not have these phases. We suggest that a continual learner need to have different stages of activeness in learning, which we refer to as: (i) Active or Wake Time, and (ii) Sleep Time. Sleep time is not a passive state, rather it internally process the data to consolidate the memories from fast unstable modules to more stable low frequency (slow) components.

Building on this observation, we propose a “sleep” paradigm for LLMs that interleaves two complementary phases: (i) Memory Consolidation, which transfers knowledge from high‑frequency layers to newly unlocked low‑frequency parameters, and (ii) Self‑Improvement via Dreaming, where the model generates synthetic data to refine its own performance.

Our technical contributions are: (1) a periodic parameter (de)activation schedule that preserves plasticity while protecting older weights; (2) Knowledge Seeding, an upward distillation where a smaller, inactive sub‑model teaches a larger, active one; (3) a generalized knowledge‑distillation objective that blends on‑policy distillation with imitation learning; and (4) an empirical suite covering factual knowledge, few‑shot learning, long‑context understanding, and continual learning.

Biological Inspiration

Human consolidation stages illustrate why offline sleep‑based memory is essential.

Human memory consolidates both while awake and during rest, but the mechanisms differ markedly. Online consolidation strengthens a memory through active recall, making it more stable and semantic‑like. Offline consolidation occurs after learning, during sleep or quiet rest, gradually transforming the memory into a distributed neocortical representation.

The NL paradigm (Behrouz et al., 2025) targets this active stage with a Hope architecture that reactivates memories in a continuum memory system. Although knowledge flows from fast, unstable components to slower, low‑frequency modules, the process retains the same abstraction level and adds no lossy compression. Consequently, online consolidation remains selective, retrieval‑dependent, and tied to the current context, missing higher‑level integration.

Offline consolidation, inspired by human sleep, provides the missing lossy compression that stabilizes knowledge over long periods. During sleep the brain engages structured activity that reshapes memories beyond mere retrieval. This stage is essential for preventing catastrophic forgetting in continual learning.

Non‑REM (slow‑wave) sleep features synchronized high‑amplitude, low‑frequency waves that drive two core processes. First, synaptic homeostasis globally downscales synaptic strengths, preserving metabolic balance. Second, memory consolidation transfers fragile experiences from the hippocampus to the neocortex, extracting abstractions into a semantic network.

Rapid‑eye‑movement (REM) sleep exhibits high‑frequency, low‑amplitude activity resembling wakefulness and is closely linked to dreaming. It selectively strengthens newly formed synapses and weaves new information into existing emotional and semantic frameworks. The stage is also hypothesized to simulate future scenarios, enhancing adaptive behavior.

The nightly alternation of NREM and REM thus first prunes and consolidates daily experiences, then explores novel connections and reinforces salient pathways.

The Continuum Memory System

CMS organizes model weights by update frequency to protect stable knowledge from rapid overwrites.

Without a way to separate fleeting inputs from durable knowledge, LLMs overwrite recent information and suffer catastrophic forgetting.

CMS arranges parameters into a hierarchy of “notebooks”: the top notebook is rewritten every few steps (high‑frequency memory), while deeper notebooks are updated only after many steps (stable, low‑frequency knowledge). This keeps rapidly changing signals from erasing long‑term facts.

Step 0: both layers idle (no update yet).

Step 1: $i=1$ not divisible by $2$ or $4$ → no updates.

Step 2: $i=2$ divisible by $C(1)=2$ → fast layer receives error $e_{2,1}= \eta(1)\sum_{t=0}^{2} f(\cdot) = 1$ (assume unit error), so $\theta^{(3)}(f_1)= -1$; slow layer unchanged.

Step 3: $i=3$ no update.

Step 4: $i=4$ divisible by both $C(1)$ and $C(2)$ → fast layer updates again ($e_{4,1}=1$) and slow layer updates ($e_{4,2}= \eta(2)\sum_{t=0}^{4} f(\cdot)=2$). New parameters: $\theta^{(5)}(f_1)= -2$, $\theta^{(5)}(f_2)= -2$.

Steps 5‑8 repeat the pattern, yielding fast‑layer updates every 2 steps and slow‑layer updates every 4 steps.

The fast layer reacts to every short burst of data, while the slow layer only incorporates a summary after several bursts, preventing the rapid fluctuations from overwriting the accumulated knowledge.

How does CMS differ from a hierarchical RNN that also processes inputs at multiple time scales?

In a hierarchical RNN the hidden states are passed upward each step, so every layer still receives a fresh signal every time step. CMS, by contrast, freezes the parameters of higher layers for many steps; they only see aggregated gradients after $C(\ell)$ steps, which isolates long‑term knowledge from short‑term noise.

Memory Consolidation Mechanism

Memory Consolidation turns fleeting updates into stable knowledge before each block refresh.

During wake time the model’s high‑frequency modules overwrite their weights at every step, which quickly erases older information and leads to catastrophic forgetting. The Sleep framework therefore inserts a consolidation step right before a block is updated, moving the distilled knowledge into a slower‑frequency module that retains it longer.

Think of a notebook that you write on all day (high‑frequency memory); before the page fills, you copy the important points into a bound archive (low‑frequency memory) so they survive the next day's scribbles.

Update 1: Block A stores $[1,0,0,0]$, Block B stores $[0,1,0,0]$.

Update 2: New inputs overwrite Block A with $[0,0,1,0]$, Block B with $[0,0,0,1]$.

Window $f_w$ expires → the two vectors are summed to $[1,1,1,1]$ (abstracted knowledge).

The summed vector is fed to the mid‑frequency block (size 2) via a linear projection, producing $[2,2]$.

Mid‑frequency parameters are updated with $[2,2]$ and then frozen; high‑frequency blocks are re‑initialized to zeros.

Consolidation preserves the cumulative signal from multiple rapid updates while freeing the high‑frequency slots for fresh information.

Detect that the current update count has reached the window $f_w$ for the target high‑frequency block.

Run a lightweight encoder to compress the block’s current activations into a knowledge vector $k$.

Apply the parameter‑expansion module to temporarily augment the block’s weight matrix, allowing $k$ to be injected without overwriting existing parameters.

Distill $k$ into the next lower‑frequency block using the shared distillation loss (see §6).

Reset the high‑frequency block’s parameters to their initial state for the next wake cycle.

**Figure 2.** An overview of Memory Consolidation. The model increases its own number of parameters to enhance its capacity (Section 3.2), and then using the knowledge seeding, it transfers the knowledge abstractions from the higher to a lower-frequency memory (Section 3.3).

**Figure 7.** Multi-frequency memory hierarchy. Updates enter the High-Frequency FFN via repeated Parameter Expansion; when the window $f_w$ expires, knowledge is Consolidated to the Mid- and then Low-Frequency FFNs (1k$\rightarrow$5k$\rightarrow\$10k).

How does Memory Consolidation differ from standard replay buffers used in continual learning?

Replay buffers store raw past examples and replay them later; Memory Consolidation instead compresses the high‑frequency block’s internal state into a distilled knowledge vector and injects it directly into a slower‑frequency block, avoiding the need to retain raw data.

Knowledge Seeding

Gradual expansion seeds larger blocks with distilled knowledge from faster ones.

High‑frequency memories are updated many times and therefore overwrite their own traces before slower memories can absorb them, leading to catastrophic forgetting. To keep the fast knowledge alive we must seed the slower block **before** the fast block’s next update.

Think of a garden: tiny seedlings (fast memories) are periodically transplanted into a larger pot (slow memory) so they keep growing instead of being trampled.

Step 1: $F$ holds parameters $\theta^{F}$; $S$ holds $\theta^{S}$.

Step 2: Add expert $A\in\mathbb{R}^{4\times2}$, $B\in\mathbb{R}^{2\times4}$ to $S$, forming expanded parameters $\theta^{S}_{\text{exp}}$.

Step 3: Sample $D$ from $\text{LM}_{\theta}$, compute teacher logits for each token.

Step 4: Perform token‑wise distillation: update $A,B$ so that $S$ reproduces teacher logits on $D$.

Step 5: Update $F$ once (fast update). Because $S$ already contains $F$’s knowledge, the update does not erase it.

Step 6: Repeat steps 3–5 nine more times before $S$ itself updates at the 10 K‑step mark.

Seeding lets the slower block accumulate the fast block’s knowledge incrementally, so when $S$ finally updates it already holds a compressed version of ten fast updates, dramatically reducing forgetting.

How does Knowledge Seeding differ from ordinary knowledge distillation where the teacher is smaller?

Standard distillation assumes a **smaller** teacher and a **larger** student that simply absorbs the teacher’s logits. In KS the student is not only larger but also **receives a dedicated low‑rank expert** that is explicitly allocated to store the teacher’s knowledge, and the process happens during the model’s “sleep” phase when no external data are available.

Identify the fastest block $\text{MLP}^{(\ell^{*}-1)}$ whose update frequency divides the current step count.

Before the fast block’s next update, add a low‑rank expert to the next slower block $\text{MLP}^{(\ell^{*})}$.

Run a sleep interval: sample $D$ from $\text{LM}_{\theta}$, perform Knowledge Seeding from $\text{MLP}^{(\ell^{*}-1)}$ into the new expert of $\text{MLP}^{(\ell^{*})}$.

Update $\text{MLP}^{(\ell^{*}-1)}$ using its own gradient step.

Repeat the above until the slower block’s own update step arrives, then move the pointer $\ell^{*}\leftarrow\ell^{*}+1$.

Distillation Objective

Defines the loss that consolidates high‑frequency memory into stable parameters via distillation.

The model must retain fleeting, high‑frequency knowledge without overwriting its long‑term core, yet naïve fine‑tuning erases what was already stable.

The loss forces the expanded (stable) parameters to absorb what the high‑frequency memory has learned while keeping the original knowledge intact.

Compute the off‑policy contribution: (1 − $\lambda$) × (0.5 + 0.6)/2 = 0.7 × 0.55 = 0.385.

Compute the on‑policy contribution: $\lambda$ × 0.2 = 0.3 × 0.2 = 0.06.

Sum both contributions: 0.385 + 0.06 = 0.445. This is the scalar loss for the batch.

Only the expanded head’s weights receive gradients proportional to 0.445; the frozen backbone stays unchanged.

The loss shows how a modest $\lambda$ lets the model learn from its own generations without letting noisy self‑samples dominate the update.

Why freeze the student’s original parameters instead of fine‑tuning them together with the expanded head?

Fine‑tuning the whole model would let the new on‑policy signal overwrite the stable knowledge encoded in the backbone, re‑introducing catastrophic forgetting. By freezing the backbone we guarantee that only the dedicated expansion adapts, preserving the long‑term core while still acquiring new high‑frequency patterns.

LTI turns the distillation into a reinforcement signal that rewards the student for reproducing teacher continuations, closing the gap between knowledge storage and usage.

Learning‑to‑Imitate loop – sample prefix, generate continuation, compute reward, update expanded head.

How does LTI differ from standard reinforcement‑learning fine‑tuning (e.g., SFT) that also uses a reward signal?

Standard RL fine‑tuning treats the whole model as a policy and updates all parameters, which can quickly erase existing knowledge. LTI, by contrast, freezes the backbone and updates only the expanded head, and its reward is explicitly tied to reproducing teacher continuations rather than an external task metric. This makes LTI a targeted imitation step that preserves stability.

The Dreaming Phase

Dreaming Phase generates selective synthetic data to refine the model while avoiding forgetting.

After consolidating high‑frequency memory into lower‑frequency blocks, the model still needs a way to keep learning without erasing what it just saved. The paper therefore adds a “Dreaming” stage that creates synthetic data to fine‑tune the model while protecting existing knowledge.

The model generates “dreams” – synthetic examples conditioned on a task context – filters them for usefulness, and then fine‑tunes on the selected dreams, thereby expanding its knowledge without destabilising the consolidated memory.

How does Dreaming differ from ordinary synthetic‑data augmentation?

Ordinary augmentation simply adds random perturbations to existing data, assuming all generated samples are equally useful. Dreaming first injects random experts during generation, then scores each sample by the gradient of the SFT loss, keeping only the most informative ones (top‑$k$ plus a few random for diversity) before fine‑tuning. This selective loop makes the synthetic data actively shape the model rather than passively expanding the dataset.

Generate two dreams: $\text{DREAM}(1)$ = “The cat climbs a tree.”, $\text{DREAM}(2)$ = “A robot paints a portrait.”.

During generation each MoE router randomly activates an extra expert, introducing a hidden pattern (e.g., a rare verb tense).

Compute importance scores: $g(1)=0.42$, $g(2)=0.07$ (larger gradient magnitude means the dream would change the loss more).

Select the top‑$k$ dream ($\text{DREAM}(1)$) and add one random dream ($\text{DREAM}(2)$) to form $D=\{\text{DREAM}(1),\text{DREAM}(2)\}$.

Fine‑tune a copy of the model on each dream using LoRA SFT, obtaining $\theta'(1)$ and $\theta'(2)$.

Evaluate: $LM_{\theta'(1)}$ improves the downstream metric, so $r=1$ for dream 1; $LM_{\theta'(2)}$ shows no gain, so $r=0$ for dream 2.

The gradient‑based score filters out dreams that would waste compute, while keeping a random sample preserves diversity and prevents the model from over‑fitting to a narrow pattern.

**Figure 8.** Memory consolidation by routed expert updates. Across Sleep cycles (left→right), a router selects and updates a small set of experts (solid), leaving others inactive (hatched), expanding capacity while limiting interference.

Empirical Evaluation

Empirical evaluation of the Sleep framework’s consolidation stages on continual and long‑context tasks.

The Sleep framework distinguishes fleeting high‑frequency inputs from stable long‑term knowledge, consolidating the former into durable parameters to curb forgetting.

HOPE, equipped with the memory‑consolidation phase, attains the highest accuracy across all three class‑incremental benchmarks.

Figure 3 shows HOPE outperforming ICL, EWC, InCA, and the baseline HOPE variant on CLINC, Banking, and DBpedia for both Llama‑3B and Llama‑3‑8B models.

Increasing the number of consolidation stages consistently improves in‑context learning and long‑context performance.

Figure 4 reports higher accuracy (or lower perplexity) for variants with more stages across MK‑NIAH, LongHealth, and QASPER.

Reducing the lowest consolidation frequency (making the most persistent memory more adaptive) degrades performance.

Figure 4 shows a clear drop when the lowest frequency is lowered, despite having the same number of stages. performance drop proportional to frequency reduction

**Figure 3.** Class-incremental learning for text classification is evaluated on the (Left) CLINC dataset (Larson et al. 2019), (Middle) Banking dataset (Casanueva et al. 2020), and (Right) DBpedia dataset (Auer et al. 2007). The HOPE architecture consistently outperforms other continual learning approaches, achieving the highest accuracy.

**Figure 4.** Effect of memory levels on in-context learning performance for (Left) MK-NIAH from RULER (Hsieh et al. 2024), (Middle) LongHealth (Adams et al. 2025), and (Right) QASPER (Dasigi et al. 2021). Lower values indicate better performance for QASPER.

**Figure 5.** Continual Translation of a Novel Language (CTNL) task. Red points show performance when training on a single language, whereas blue points show performance under continual learning.

Performance Benchmarks

Key numbers show the Sleep framework beats all baselines on reasoning and language tasks.

The following tables summarize how the Sleep framework performs on a suite of mathematical‑reasoning and continual‑learning benchmarks.

Sleep attains the top scores on all three reasoning benchmarks, reaching 79.2 on AIME‑24, 69.0 on AIME‑25, and 46.1 on HMMT‑25, thereby surpassing every baseline.

Table 1 reports these values for Sleep and lower numbers for all ablations, OPSD, and OPSD + Expansion.

**Table.** Performance comparison of different methods on AIME-24, AIME-25, and HMMT-25 benchmarks using the Qwen3-8B model.

Beyond static reasoning, the authors evaluate the Sleep (Hope) framework on continual language learning, comparing three consolidation stages (Hope‑1, Hope‑2, Hope‑3) against standard in‑context learning (ICL).

Hope‑3 almost fully recovers the single‑language performance after sequential exposure, whereas ICL collapses to pre‑trained levels, highlighting the benefit of sleep‑time consolidation.

**Table 2.** Performance of different methods on mathematical reasoning benchmarks. We use different variants of Qwen models and report average@16.

The BABILong benchmark further demonstrates Sleep’s scalability: with a 10 M‑token context the method attains an almost perfect score, far surpassing prior long‑context models.

Reasoning Performance

Sleep boosts high‑frequency reasoning performance across model sizes.

Sleep achieves the highest HMMT‑25 score for the Qwen3‑8B model.

Sleep: 46.1 % vs. GRPO 45.1 % and OPSD 44.9 % on HMMT‑25.

SFT fine‑tunes a pretrained language model on a curated instruction dataset, aligning its outputs with human‑written responses.

How does SFT differ from standard pre‑training?

Pre‑training learns from raw text without explicit instruction signals; SFT adds a supervised layer of human‑written prompts and desired outputs, steering the model toward task‑specific behavior.

GRPO optimizes a policy by comparing groups of sampled actions, encouraging improvements relative to the group’s average performance.

Why not use a simple REINFORCE baseline instead of GRPO?

GRPO’s group‑wise baseline adapts to the current batch’s reward distribution, whereas a fixed baseline can be mismatched and increase gradient variance, slowing convergence.

Ablations and Conclusion

We assess how each Sleep component affects performance and recap the overall contribution.

LLMs struggle to keep fleeting inputs separate from durable knowledge; Sleep adds a multi‑frequency memory hierarchy that consolidates high‑frequency memory into stable parameters.

This table compares the success rates of four different methods: ICL, TTT, SEAL, and Sleep.

We follow the SQuAD integration protocol of Zweiger et al. (2025), comparing a base model, a fine‑tuned model without dreaming, SEAL, and our Transformer‑based two‑ and four‑level memory systems. Across both evaluation regimes, every added component raises accuracy, confirming the value of the consolidation steps.

Sleep attains an 80 % success rate on the few‑shot ARC benchmark, the highest among all compared methods.

Table 4 shows Sleep at 80 % versus SEAL at 72.5 % and the next best method below that.

For the few‑shot experiment we use the ARC protocol, filter out unsolvable tasks, and train on 11 tasks while holding out 8 for evaluation. The backbone is Llama‑3.2‑1B, and the Sleep pipeline generates synthetic “dream” examples that augment the limited training data.

Ablations on the memory‑consolidation pipeline (Figure 1) reveal that each design element—gradient‑based selection, a random expert, and the Dreaming phase—contributes positively; removing any of them reduces performance on mathematical reasoning benchmarks.

Read the original paper

Open the simplified reader on Paperglide