Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities without Retraining

DG-Hard uses spectral filtering to recover capabilities lost during fine-tuning without retraining or labeled data.

How can we post-hoc repair language models that have lost capabilities during fine-tuning without retraining or access to the original training data?

Fine-tuning language models often destroys existing capabilities, a form of catastrophic forgetting that standard optimization cannot prevent. The authors treat the fine-tuning weight update as a signal-plus-noise matrix, where task-relevant updates are singular-value spikes and collateral damage is random-matrix bulk. DG-Hard applies a closed-form Donoho-Gavish hard threshold to these singular values, stripping the noise residual while preserving the task-aligned signal. Across 14 model-task settings, this data-free repair achieves the strongest balanced recovery of damaged capabilities while retaining fine-tuning gains.

Paper Primer

Fine-tuning moves a model's weights in two ways: it encodes the target task, but it also accumulates thousands of mini-batch residuals that act as noise. Because these residuals are interleaved with task-relevant information in coordinate space, simple methods like scalar interpolation or pruning struggle to separate the two without sacrificing performance.

DG-Hard is a spectral denoiser: it decomposes the weight-delta matrix into singular values, identifies the noise bulk using random-matrix theory, and zeroes out everything below the AMSE-optimal Donoho-Gavish threshold. This leaves only the structured, task-aligned singular components to be added back to the base model.

DG-Hard achieves the best recovery-preservation trade-off among post-hoc baselines.

Across 14 (model, task) settings and 9 held-out benchmarks, it consistently outperforms coordinate-space methods like TIES and DARE. It restores safety alignment on three independent axes without using any alignment data.

Why is this approach more effective than simply interpolating between the base and fine-tuned models?

Interpolation is a global scalar operation that cannot distinguish between task-aligned signal and collateral noise. DG-Hard operates in singular-value space, allowing it to surgically remove the noise bulk while keeping the high-energy signal spikes intact.

Does this method require access to the original training data or a calibration set?

No. DG-Hard is entirely data-free and gradient-free; it requires only the base and fine-tuned checkpoints, making it applicable to proprietary models where the original training corpus is unavailable.

For researchers and practitioners, this paper demonstrates that catastrophic forgetting is partially a removable spectral artifact. You can now repair damaged model capabilities post-hoc in minutes without needing to re-run training pipelines.

Introduction

We define catastrophic forgetting in fine‑tuning and propose a spectral repair to recover lost capabilities.

Fine‑tuning a pretrained language model often boosts performance on the target task but simultaneously erodes capabilities that were present in the original checkpoint, a phenomenon known as catastrophic forgetting.

When a model is adapted to a new objective, gradient updates can overwrite parameters that encode unrelated skills, so the model loses those skills even though they were never explicitly penalized.

Storing the full $\Delta$ for all 12 M parameters of a 4‑billion‑parameter model would need roughly 48 MB per matrix, quickly exceeding GPU memory when many layers are involved.

DG‑Hard’s spectral filtering can discard the bulk of $\Delta$ without touching the high‑energy singular components, reducing the effective storage to a few singular vectors.

This illustrates why a post‑hoc repair must be aware of the size of $\Delta$: naïvely materialising the entire update is infeasible for large models.

The weight‑delta matrix is often low‑rank; most of its energy concentrates in a handful of singular values, making spectral pruning a memory‑efficient remedy.

Existing post‑hoc baselines such as WiSE‑FT, DARE, TIES, and FAPM decide which entries of $\Delta$ to keep by looking at individual coordinates (magnitude, sign, or random dropping), but task‑relevant and harmful entries are interleaved, so these methods trade off recovery against preservation.

When we examine the singular‑value spectrum of $\Delta$ across many fine‑tuning runs, a clear spectral cliff appears: a narrow bulk follows the Marchenko‑Pastur law (noise), while a few outlying spikes carry the task‑aligned signal.

DG‑Hard exploits this structure by applying the Donoho‑Gavish hard singular‑value threshold to each $\Delta$ matrix, discarding the bulk and retaining the spikes; the repaired checkpoint is W* = Wbase + $\Delta^*$.

The method is completely data‑free, requires no gradient computation, and finishes in minutes on a single GPU, making it a practical post‑hoc fix for any fine‑tuned model.

To evaluate repair quality we introduce a partition‑conditional metric that separately measures healing of damaged benchmarks, preservation of improvements, non‑damage to unchanged abilities, and retention of the target‑task performance.

Across 14 (model, task) settings and nine held‑out benchmarks, DG‑Hard achieves the strongest balanced repair among all post‑hoc baselines, and it also restores safety alignment without any alignment data.

The core problem is post‑hoc capability recovery: how to undo fine‑tuning damage while keeping the newly acquired task performance.

Background and Related Work

Provide a concise overview of prior approaches to mitigate forgetting and post‑hoc weight repair.

Fine‑tuning a pretrained model can erase previously learned capabilities, a phenomenon known as catastrophic forgetting. Early work linked this to distributed representations perturbing weights that encode earlier tasks, and recent studies have reproduced the effect in large language models across both general‑knowledge and safety benchmarks.

Methods that add a penalty to the optimizer, weighting each parameter’s update by an importance estimate derived from the previous task.

Techniques that attenuate gradients on units deemed important for the pretrained capabilities, effectively freezing them during fine‑tuning.

Approaches that interleave gradient steps on stored examples from previous tasks to preserve their performance.

Constrains the fine‑tuning update to a low‑rank factor $BA$ with rank $r \ll \min(m,n)$, keeping the base weights frozen.

Family of techniques that operate on the weight‑delta matrix $\Delta = W_{\text{ft}} - W_{\text{base}}$ after fine‑tuning, without additional training data.

WiSE‑FT blends the pretrained model and its fine‑tuned version with a single scalar, letting the user trade off between retaining prior knowledge and adopting new task behavior.

TIES‑Merging first discards tiny changes in each task vector, then keeps only those entries where multiple tasks agree on the sign, finally averaging the consensus entries.

DARE randomly drops entries of the weight‑delta matrix and rescales the survivors, producing a sparse perturbation that can be merged without heavy computation.

Our contribution differs by operating in singular‑value space: we apply the Donoho‑Gavish optimal hard threshold to the spectrum of $\Delta$, a statistically optimal denoiser that has not been used for fine‑tuning repair.

The DG-Hard Repair Method

DG‑Hard repairs fine‑tuned weights by stripping IID noise via an optimal singular‑value threshold.

Fine‑tuning updates mix a low‑rank task‑specific signal with an IID noise residual; the latter drives catastrophic forgetting.

After fine‑tuning, we post‑process each weight matrix by discarding singular directions that look like pure noise, leaving only the task‑aligned signal.

$\Delta$ = $W_{\text{ft}$}−$W_{\text{base}$} = \begin{bmatrix}2&0\\0&1\end{bmatrix}.

SVD of $\Delta$ yields singular values $s=[2,1]$ (U and V are identity).

Median(s)=1.5, $\beta$=1, $\omega$(1)=4/√3≈2.309, $\sigma$̂≈1.5/$\mu$_$\beta$≈1.5 ($\mu$_$\beta$≈1 for square case).

$\tau^*$ ≈ 2.309 × 1.5 ≈ 3.46, so both singular values fall below $\tau^*$ and are zeroed.

$\Delta^*$ = 0, thus $W^{*}=W_{\text{base}}$ – the repair discards the entire update as noise.

The example shows that when the estimated noise scale is comparable to the update magnitude, the optimal threshold can erase the whole delta, illustrating why accurate $\sigma$̂ matters.

How does DG‑Hard differ from simply truncating to a fixed rank?

Truncation discards the smallest singular values regardless of their statistical origin; DG‑Hard computes a data‑driven threshold $\tau^*$ that separates values statistically indistinguishable from IID noise, preserving any low‑magnitude but signal‑rich directions.

The singular‑value spectrum of a fine‑tuned delta often shows a sharp drop: a few large outliers (signal) followed by a dense bulk (noise).

Why isn’t the bulk edge $\lambda_{MP}$ itself used as the cutoff?

$\lambda_{MP}$ is the theoretical maximum for pure noise; singular values just above it can still be dominated by noise variance, so cutting there would retain many noisy directions and hurt AMSE.

Gavish & Donoho derived a closed‑form threshold $\tau^*$ that minimizes the asymptotic mean‑squared error of low‑rank signal recovery.

Median(s)=3, $\mu$_$\beta$≈1 ⇒ $\sigma$̂≈3.

$\beta$=1 ⇒ $\omega$(1)=2.309.

$\tau^*$ = 2.309 × 3 × √4 ≈ 13.85.

All $s_i$ ≤ $\tau^*$, so the hard shrinker zeroes every singular value, yielding $\Delta^*$ = 0.

This toy case shows that when the noise estimate is large relative to the singular values, the optimal threshold can be higher than any observed value, leading to a full discard of the update.

Why does $\omega$($\beta$) increase for highly rectangular matrices?

Rectangular shapes inflate the bulk edge of the Marchenko‑Pastur distribution; $\omega$($\beta$) compensates by raising the threshold to keep the same false‑positive rate for noise singular values.

$\Delta$ ← $W_{ft}$ − $W_{base}$

(U, s, V) ← SVD($\Delta$)

p ← min(m, n); $\beta$ ← p / max(m, n)

$\sigma$̂ ← median(s) / $\mu$_$\beta$ ($\mu$_$\beta$ is the median‑based Donoho–Gavish constant)

$\tau$ ← $\omega$($\beta$) $\sigma$̂ √max(m, n)

s′_i ← $s_i$ · 1_{$s_i$ > $\tau$} for i = 1,…,p

$\Delta^*$ ← U diag(s′) $V^{\top}$

return $W_{base}$ + $\Delta^*$

**Algorithm 1** DG hard repair (per weight matrix). **Require:** Base weight $W_{base} \in \mathbb{R}^{m \times n}$, fine-tuned weight $W_{ft} \in \mathbb{R}^{m \times n}$. **Ensure:** Repaired weight $W^* \in \mathbb{R}^{m \times n}$. 1: $\Delta \leftarrow W_{ft} - W_{base}$ 2: $(U, \mathbf{s}, V) \leftarrow \text{SVD}(\Delta)$ 3: $p \leftarrow \min(m, n); \beta \leftarrow p / \max(m, n)$ 4: $\hat{\sigma} \leftarrow \text{median}(\mathbf{s}) / (\mu_{\beta} \sqrt{\max(m, n)})$ $\quad \triangleright$ noise scale, App. B 5: $\tau \leftarrow \omega(\beta) \hat{\sigma} \sqrt{\max(m, n)}$ $\quad \triangleright$ optimal threshold, (4) 6: $s'_i \leftarrow s_i \cdot \mathbb{1}\{s_i > \tau\}$ for $i = 1, \dots, p$ 7: $\Delta^* \leftarrow U \text{diag}(\mathbf{s}') V^{\top}$ 8: **return** $W_{base} + \Delta^*$

Singular Value Decomposition (SVD) factorises a matrix into orthogonal left/right singular vectors and a non‑negative singular‑value spectrum, exposing the directions along which the matrix stretches space.

Independent and Identically Distributed (IID) noise assumes each residual entry in $\Delta_{n}$oise is drawn from the same zero‑mean distribution and is statistically independent of the others.

**Figure 4.** Spectral unforgetting in two views, on Llama-3.2-3B mlp.`up_proj` at layer 14, with $\Delta = W_{ft} - W_{base} \in \mathbb{R}^{8192 \times 3072}$ ($\beta = 0.375$). (a) The fine-tune delta has a sharp spectral cliff: 30 singular values (red) lie above the DG-Hard threshold $\tau^* = \omega(\beta)\hat{\sigma}$, while the remaining 3042 (gray) sit at or below the Marchenko-Pastur bulk edge $\lambda_{MP}$. Top row, (b) to (d): entry-space view of the additive identity $\Delta_{FT} = \Delta^* + (W_{ft} - W^*)$, with each pixel showing $\max |\Delta_{ij}|$ over a block of matrix entries. The full FT delta (b) is uniform speckle; the rank-30 repaired delta (c) reveals horizontal banding from the kept left-singular vectors $u_r$; the discarded component (d) is again uniform speckle, carrying no spatial structure. Bottom row, (e) to (g): spectrum-space view of the same identity, with each pixel showing $\max \sigma_r |u_r[i]|$, the contribution of singular direction $r$ to output neuron $i$. Panel (e) decomposes into the kept-only panel (f), nonzero only in its leftmost 30 columns, and the bulk-only panel (g), nonzero everywhere except those columns; their pixel-wise sum reproduces (e) exactly, because every singular direction belongs to exactly one set. Random-matrix-fit chips above the top row report the percentage match between each matrix's singular-value distribution and the Marchenko-Pastur prediction (green: structured, non-random; red: IID-noise-like): $\Delta_{FT}$ scores 93.1%, the rank-30 repaired delta $\Delta^*$ scores 1.0%, and the discarded component scores 94.0%, slightly more MP-like than $\Delta_{FT}$ itself, confirming that DG-Hard separates the two without leaving residual signal in the noise.

The evaluation protocol measures how well a method heals damage (negative FT deltas) and preserves gains (positive FT deltas) across 126 model‑task‑benchmark cells, aggregating via harmonic means to avoid over‑optimising any single axis.

Experimental Results

DG‑Hard dominates the recovery‑preservation trade‑off across all cohorts.

The method rests on the premise that fine‑tuning injects task‑specific signal plus IID noise, the latter forming a spectral cliff in the weight‑delta matrix that can be removed.

DG‑Hard attains the highest balanced recovery‑preservation score, leading on the reasoning model and matching the strongest baseline on the non‑reasoning model.

Figure 3 (right panel) shows DG‑Hard’s Knowledge combined score of 84.8, the top value across methods.

Section 4.1 lists the experimental setup: each cell is a (model, task, held‑out benchmark) triple, yielding a 126‑cell matrix across 14 fine‑tuned checkpoints.

We evaluate two models that span the reasoning / non‑reasoning split: Qwen3.5‑4B (run in thinking mode) and Llama‑3.2‑3B (standard inference).

**Figure 1.** Recovery $\times$ preservation per cohort. Each panel plots the % healed score on the damaged partition (x-axis) against the % preserved score on the improved partition (y-axis), as defined in §4.3. The ideal corner is (100, 100), and the dotted contour marks HM(% healed, % preserved) = 80. DG-Hard (blue diamond) is closest to the ideal corner across all five cohorts. FAPM [16] strongly recovers damaged measurements but sacrifices improved ones; V-SoftMask [23] preserves improved measurements but recovers less damage.

**Figure 2.** Population-level Combined score per method, sliced by cohort. Panel titles list $n = (\text{damaged/improved/unchanged})$ triple counts per cohort. DG-Hard tops Overall, Cognition, Reasoning, and Non-reasoning; L1-reg edges past DG-Hard on the small-$n$ Knowledge cohort, where its 102.0 Clean-up reflects % healed overshooting base on the 5-case damaged partition. The per-cell balance view in Tab. 10 resolves the Llama cohort into split wins between DG-Hard and WiSE-FT [13]. Methods that collapse on either Clean-up or Retention drop to a low Combined via the harmonic mean's bottlenecking property.

Table 2 breaks down % preserved by improvement bucket (mild, moderate, large); DG‑Hard consistently exceeds WiSE‑FT in every bucket.

**Figure 3.** Trade-off frontiers (both axes 0 to 100, higher is better). Left: Clean-up vs Retention. DG-Hard sits in the upper-right region where both axes are simultaneously high; V-SoftMask [23] is the retention-extreme (top-left); FAPM [16] is the cleanup-extreme (bottom-right). Right: Knowledge-cohort Combined (x) vs Cognition-cohort Combined (y). DG-Hard sits high on both (84.8 and 82.1), the most balanced strong method; WiSE-FT [13] matches on Knowledge but loses ground on Cognition (66.8) because its preservation drops there; V-SoftMask sits in the Knowledge-favouring region (67.7 vs 30.9); CoFi-Tune [27], FAPM, and LoRA [18] collapse on Cognition due to non-positive average preservation.

Table 3 reports StrongREJECT harmfulness; DG‑Hard recovers the majority of the gap toward the Base model and never degrades performance, even improving safety on Qwen.

DG‑Hard consistently outperforms baselines in combined retention and cleanup.

Empirical and Theoretical Foundations

Key ablation numbers reveal which components truly drive balanced performance.

DG‑Hard improves the Combined metric by 6.9 points over WiSE‑FT.

Table 13 shows DG‑Hard Combined = 83.3 versus WiSE‑FT Combined = 76.4.

V‑SoftMask’s Combined score collapses to 0 despite a ≈ 99 Retention score.

Table 12 reports Retention ≈ 99 for V‑SoftMask but Clean‑up near 0, yielding a near‑zero harmonic mean.

**Table.** Performance comparison across various methods. The table evaluates methods based on metrics including % healed, Non-damage, % preserved, On-task ret., Clean-up, Retention, and a Combined score.

**Table 9.** Per-cell impact of fine-tuning across the 14 (model, task) cells of our experimental matrix. Fine-tuning damages at least one held-out benchmark in 13 of 14 cells and incidentally improves more held-out benchmarks than it damages (55 vs 30 in total).

**Table 10.** Per-cell balance scores per (model, task) cell, plus the cohort aggregation of the winner column. The balance score is the harmonic mean of the method's mean held-out ratio (vs base) and on-task ratio (vs FT), each multiplied by 100. FAPM [16] is excluded because it never wins on balance.

**Table 1.** Performance comparison of various methods across different categories including Overall, Knowledge, Cognition, Reasoning, and Non-reasoning.

The table presents metrics for different model classes, including "Mean noise", "Mean norm-share", and "Noise $\times$ norm-share".

**Table 13.** Per-cohort Clean-up / Retention / Combined for DG-Hard vs WiSE-FT [13]. Bold marks the higher Combined score in this head-to-head DG-Hard vs WiSE-FT comparison only; it is not a cross-method best-per-cohort indicator. For cross-method per-cohort winners across all baselines, see App. F (Tabs. 11 and 12); per §4.6, DG-Hard wins Combined on four of five cohorts at the cross-method level (L1-reg edges it on Knowledge).

**Table 15.** Per-benchmark held-out scores for Llama-3.2-3B-Instruct across all (task, method) cells. Bold = best per column within the task block; underline = second-best (gap ≥ 0.001 from best). Avg. = mean of the nine held-out benchmarks. Results = on-task task_{name} score. Combined = HM(% healed, % preserved) on the partitioned held-out set, with per-benchmark % healed and % preserved clipped to [0, 100] before averaging. Pre-trained and Full-SFT have no defined Combined. Bold/underline on Combined follow the same convention as other columns, restricted to the seven repair methods (excluding the two reference rows). “Repair methods” here covers both training-time interventions (L1-reg, V-SoftMask [23], CoFi-Tune [27], LoRA [18]) and post-hoc methods (WiSE-FT [13], FAPM [16], DG-Hard); see §2 for the full bucketing.

Extended Performance Data

DG‑Hard raises average accuracy by 12 percentage points over the baseline.

DG‑Hard lifts the mean task accuracy from 0.8464 (baseline) to 0.9674.

Baseline mean = 0.8464 (e.g., 0.8464, 0.8270, 0.8425 …); DG‑Hard mean = 0.9674 (e.g., 0.9674, 0.9651, 0.9630 …).

Across the board, DG‑Hard reduces the spectral‑cliff‑induced degradation that the baseline suffers, yielding uniformly higher scores. The improvement is most pronounced on tasks where the baseline hovered around 0.75, jumping to above 0.90 after repair.

Benchmark Results

DG‑Hard achieves the top combined healed‑preserved score on Llama‑3.2‑3B‑Instruct.

DG‑Hard attains the highest combined healed‑preserved score on the held‑out benchmark set.

Table 15 shows a Combined HM of 0.9664 for DG‑Hard, exceeding the next best 0.9638.

Additional Benchmark Metrics

DG‑Hard’s per‑class noise reduction outperforms all baselines across six model‑task pairs.

DG‑Hard (Ours) achieves the lowest noise‑mass product across all three examined layers.

Table 16 shows the product “Noise × norm‑share” for DG‑Hard is 0.12, whereas the next‑best baseline (CoFi‑Tune) records 0.14.

The preceding matrices list per‑method accuracies on seven benchmark QA tasks (MedQA … RTE) and the corresponding per‑class noise statistics. Across all methods, DG‑Hard consistently yields the smallest “Noise × norm‑share” values, confirming that its optimal hard‑thresholding repair effectively suppresses the IID noise identified by the spectral cliff.

Appendix Overview

This section catalogs the appendix contents and key experimental details.

The appendix is organized into a hierarchy of sections (A–I) that expand on the paper’s assumptions, derivations, metrics, and reproducibility details.

Section A presents empirical support for the signal‑plus‑noise decomposition $Δ = Δ_{\text{signal}} + Δ_{\text{noise}}$, showing that fine‑tuning damage concentrates on a few benchmarks while the bulk behaves like i.i.d. noise.

Section B derives the Donoho‑Gavish noise estimator used to compute the noise scale $σ$ from the median singular value of $Δ$.

Section C explains the suite of evaluation metrics (e.g., % healed, % preserved, non‑damage, on‑task retention) and why a harmonic‑mean aggregation is required.

Section D details the experimental setup, including the fine‑tuning tasks, held‑out benchmarks, training hyperparameters, and the scope of the repair operation.

D.1 lists the seven fine‑tuning tasks (RTE, StrategyQA, ReClor, BoolQ, MedQA, WikiQA, Winogrande) and their dataset splits.

D.2 enumerates the nine cross‑domain held‑out benchmarks (ARC‑Challenge, GSM8K, HellaSwag, IFEval, Math‑500, MMLU, MNLI, TriviaQA, TruthfulQA) used for evaluation.

D.3 provides the uniform training hyperparameters (optimizer, learning‑rate schedule, batch size, precision, etc.) applied to every (model, task) cell.

D.4 defines the repair scope: all tensors with $ndim \geq 2$ and ≥ 1 024 elements are processed; 1‑D parameters are left untouched.

Sections E–G report per‑cell forgetting, balance scores, and a full breakdown of cohort‑level sub‑scores.

Section H aggregates results across every (model, task, method, benchmark) combination and includes method‑specific configuration details (H.1) and inference‑engine settings (H.2).

Section I contains additional experiments and ablations: layer‑wise noise concentration (I.1), a layer‑mask causal test (I.2), and a comparison with element‑wise merging baselines (I.3).

**Table 5.** The seven fine-tuning tasks forming the row axis of the experimental matrix. The eval split of each task becomes that cell's on-task benchmark task_{name} throughout the analysis.

**Table 6.** The nine held-out cross-domain benchmarks evaluated identically across every (model, task, method) combination. The IFEval metric is the average of prompt $\times$ instruction strict $\times$ loose accuracy.

The table outlines the evaluation benchmarks used, including the specific Hugging Face dataset, the number of evaluation prompts (Eval n), the judge model or method employed, and the aggregate metric calculated for each benchmark.

The provided image contains a table listing various hyperparameters used for model training, including the optimizer, learning rate, batch size, and precision settings.

**Table.** Per-axis trends.

Methodological Configurations

Appendix D details method configurations, inference settings, and extra experiments supporting DG‑Hard.

Section H.1 lists the concrete hyper‑parameters used for each repair baseline evaluated in the paper.

Section H.2 describes the inference stack: vLLM with eager execution, max model length 8192, bfloat16, and greedy decoding (temperature 0).

**Table.** Comparison of mean noise levels across different model layers for Llama and Qwen families.

Section I aggregates three auxiliary experiments that justify design choices for DG‑Hard.

**Table 14.** Per-benchmark held-out scores for Qwen3.5-4B across all (task, method) cells. Bold = best per column within the task block; underline = second-best (gap $\ge$ 0.001 from best). Avg. = mean of the nine held-out benchmarks. Results = on-task task_{name} score. Combined = HM(% healed, % preserved) on the partitioned held-out set, with per-benchmark % healed and % preserved clipped to $[0, 100]$ before averaging; cells with no damaged or no improved benchmarks default the corresponding side to 100. Pre-trained and Full-SFT have no defined Combined (they parameterize the partition). Bold/underline on Combined follow the same convention as other columns, restricted to the seven repair methods (excluding the two reference rows). "Repair methods" here covers both training-time interventions (L1-reg, V-SoftMask [23], CoFi-Tune [27], LoRA [18]) and post-hoc methods (WiSE-FT [13], FAPM [16], DG-Hard); see §2 for the full bucking.

Open the simplified reader on Paperglide

Browse all simplified papers