Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

Q: What is the main contribution of this paper?

The paper introduces Alternating Token-Weighted Unlearning (ATWU), a method that treats token-level forget-specificity as a latent variable and uses a lightweight linear scorer over LLM hidden states to identify which tokens to suppress during unlearning, achieving state-of-the-art forget-retain trade-offs without requiring external annotations.

Q: What problem does ATWU address?

ATWU addresses the problem that standard LLM unlearning methods apply a uniform forget loss to all tokens in a forget sample, including structural tokens like function words that do not encode targeted knowledge, which degrades the model's general linguistic capabilities while failing to precisely remove the unwanted information.

Q: Why is token-level weighting necessary for LLM unlearning?

Forget samples contain both targeted knowledge-bearing tokens and structural linguistic patterns; applying a uniform forget loss to the entire sample harms general language utility, so selective weighting allows the model to target only the tokens that actually encode the unwanted knowledge.

Q: How does ATWU work technically?

ATWU parameterizes token forget-specificity using a linear projection over the LLM's hidden representations, producing a continuous score per token via a sigmoid function. It then alternates between updating the language model parameters using the token-weighted forget objective and updating the scorer parameters, with entropy and quadratic penalty regularizers to encourage binary, budget-constrained selections.

Q: What is the 'retain conflict' concept used in ATWU?

Retain conflict is the key criterion for identifying forget-specific tokens: a token is deemed forget-specific if suppressing it does not force the model away from its retain-optimal state, meaning its removal targets only unwanted knowledge rather than general linguistic structure.

Q: What forget loss does ATWU use and why?

ATWU uses a saturated negative cross-entropy loss (called SATGA+), where the learned token score is injected into both the weighting coefficient and the saturation exponent; this bounds the loss below, controls the target suppression probability via hyperparameter β, and smoothly attenuates uncertain tokens during early optimization before the scorer converges.

Q: What benchmarks and models are used to evaluate ATWU?

ATWU is evaluated on TOFU (a synthetic QA benchmark with fictitious author biographies, using Llama-3.2-1B-Instruct and Llama-3.1-8B-Instruct across forget01, forget05, and forget10 splits) and RWKU (a real-world knowledge benchmark targeting public figures, using Phi-3-Mini-4k-Instruct on a fixed ten-subject batch).

Q: What metrics are used to evaluate unlearning performance?

The paper uses Forget Quality (FQ, the relative reduction in worst-case forget-set judge score), Retain Degradation (RD, the relative loss in retain-set judge score), and Unlearning Quality (UQ = [FQ − RD]+, the net forget-retain trade-off), alongside auxiliary metrics including Extraction Strength (ES), judge-based robustness scores (JP, JICR, JW), MMLU accuracy, Repetitiveness, Win Rate, and ROUGE-L-based RWKU scores.

Q: What are the key results of ATWU?

ATWU achieves the best Unlearning Quality (UQ) on both the TOFU forget10 benchmark with Llama-3.1-8B-Instruct and the canonical RWKU ten-subject batch with Phi-3-Mini-4k-Instruct, while preserving utility close to the original checkpoint; competing methods such as JENSUN forget aggressively but collapse generation quality, and RMU does not match ATWU's consistent performance across all six metric panels.

ATWU learns to identify and suppress forget-specific tokens by treating unlearning as a joint optimization problem.

How can we automatically identify and weight the specific tokens in a "forget" sample that actually contain the targeted knowledge, rather than treating all tokens in the sample as equally important?

Large language models often memorize sensitive or harmful data, but standard unlearning methods struggle to remove this information without degrading the model's general linguistic capabilities. The authors introduce Alternating Token-Weighted Unlearning (ATWU), which treats token-level forget-specificity as a latent variable. It uses a lightweight linear scorer to identify which tokens to suppress by analyzing the conflict between the forget objective and the retain objective. ATWU achieves state-of-the-art forget-retain trade-offs on standard benchmarks, outperforming both sample-level methods and heuristic-based token weighting without requiring external supervision.

Paper Primer

ATWU identifies forget-specific tokens by measuring "retain conflict": a token is deemed forget-specific if suppressing it does not force the model away from its retain-optimal state. The method uses a linear projection over hidden states to score tokens, alternating between updating the model parameters and the scorer to ensure the selection process remains grounded in the model's own representations.

ATWU achieves superior unlearning quality (UQ) compared to existing token-weighted and sample-level baselines.

On the TOFU benchmark, ATWU achieved a UQ of 91.7, outperforming the runner-up (RMU) by 5.8 percentage points. 6.8 percentage point improvement over the next-best method on the RWKU benchmark.

The authors demonstrate that high-quality token scores are insufficient on their own; the unlearning objective must also be adaptive. By injecting the learned token scores into a saturated forget loss, ATWU prevents over-forgetting and stabilizes the optimization process, effectively bridging the gap to supervised oracle performance.

Why is token-level weighting necessary for unlearning?

Forget samples contain both targeted information and structural linguistic patterns (like function words). Applying a uniform forget loss to the entire sample degrades general language utility; selective weighting allows the model to target only the specific tokens that encode the unwanted knowledge.

How does ATWU avoid the need for external annotations?

ATWU treats token weights as latent variables in a joint optimization problem. Because the model's hidden states already encode linguistic structure, the scorer learns to identify forget-specific tokens directly from the interaction between the retain and forget objectives.

Introduction to Token-Weighted Unlearning

We expose the problem that uniform token forgetting harms model performance and motivates token‑aware unlearning.

Machine unlearning seeks to erase specific knowledge from a trained model while keeping its general abilities intact. In autoregressive language models, a forget sample contains many tokens that are merely structural and do not encode the targeted information, so applying the forget loss uniformly harms the model’s overall performance. The core problem is that existing methods treat all tokens as equally important, leading to indiscriminate forgetting of both relevant and irrelevant tokens.

Given a trained model, we want to remove the influence of a specific set of training examples without degrading its ability to perform other tasks.

**Figure 1.** (a) Token-level scores on a TOFU forget sample. SEUL, SU-LLM, and SATIMP assign substantial weight to both forget-specific and structural tokens, whereas ATWU concentrates on the bold ground-truth forget-specific span. (b) ATWU can be combined with diverse forget losses. For each objective, DPO, NPO, SIMNPO, and SATGA, solid bars denote baseline Unlearning Quality (UQ), and added segments denote the gains from using ATWU token weights with the corresponding forget loss. ATWU consistently improves the forget-retain trade-off.

The core obstacle to effective unlearning is indiscriminate token forgetting.

Prior Approaches to Unlearning

We position our approach among sample‑level, token‑weighted, probing, and evaluation studies.

Related work clusters into four strands: sample‑level unlearning, token‑weighted variants, probing of hidden representations, and robust evaluation metrics.

Instead of treating every token in a forget sample as equally responsible, the method learns a per‑token weight that highlights the tokens actually carrying the target knowledge.

Sample‑level, token‑agnostic methods define the forget target at the granularity of whole examples, documents, or QA pairs.

Maximizes cross‑entropy loss on forget samples to push the model away from memorized content.

Combines ascent on forget samples with descent on retain samples or regularization to balance deletion and preservation.

Frames unlearning as an alignment problem, using refusal‑style or negative‑preference objectives to steer the model away from undesired outputs.

Replaces separate forget and retain losses with a Jensen‑Shannon divergence objective, yielding stable unlearning dynamics.

Perturbs internal representations associated with forget samples to diminish memorized content.

Distills the model toward adjusted output distributions that lower the probability of memorized tokens.

Token‑weighted approaches augment a sample‑level objective with a mechanism that estimates which tokens actually encode the target knowledge.

Train a separate model on retain and forget splits and compare predictions to infer token‑specific forget relevance.

Use model confidence, entropy, or loss magnitude as cheap proxies for token forget‑specificity.

Leverage linguistic parsers or LLM‑generated annotations to flag forget‑specific tokens.

Probing literature shows that hidden states already encode rich linguistic attributes, motivating a linear scorer over those representations.

Train simple classifiers on frozen hidden activations to recover part‑of‑speech, morphology, syntax, and semantics.

Analyze hidden‑state geometry to reveal syntactic relations directly, beyond downstream decoding.

Evaluation of unlearning balances deletion success against retained utility, using benchmarks such as TOFU, MUSE, and RWKU.

Fictitious‑author QA benchmark measuring both forget quality and downstream utility.

Extends evaluation to broader corpora, adding memorization, privacy leakage, and scalability metrics.

Real‑world entity forgetting benchmark that also measures reasoning, factuality, and fluency.

Shows that models can appear to forget while still retrieving information under paraphrase or perturbation.

**Figure 3.** Token-level forget-specificity from SEUL, SU-LLM, SU-NGRAM, FUNDIAL, ETW, WGA, SATIMP, and ATWU on two TOFU forget samples. Shading reflects each method's raw token score; bold spans mark the ground-truth forget-specific tokens; ATWU concentrates most clearly on the answer-bearing spans.

Formalizing Token-Weighted Unlearning

We formalize unlearning as a joint optimization problem that identifies forget-specific tokens via retain conflict.

Machine unlearning often fails because it treats all tokens in a forget sample as equally important. We formalize the token-selection problem by identifying forget-specific tokens—those carrying targeted knowledge—versus structural tokens that preserve linguistic patterns.

Retain conflict measures the cost of suppressing a specific token while trying to stay within the model's optimal performance range on retain data.

Computing the singleton conflict $\kappa_i$ for every token is computationally intractable. Instead, we show that we can recover the oracle token weights $z^*$ by solving a joint optimization problem over both model parameters $\theta$ and a binary weight vector $z$ under a fixed budget constraint.

Under Assumption A1 (Retain-conflict separation), if $(\hat{\theta}, \hat{z})$ is a global minimizer of the joint objective with budget $\rho \le \rho^*$, then the support of $\hat{z}$ is contained within the set of true forget-specific tokens $\mathcal{F}^*$. If $\rho = \rho^*$, the joint optimization perfectly recovers the oracle labels and the oracle model parameters.

The ATWU Algorithm

4 Alternating Token-Weighted Unlearning

We now turn the joint formulation into a practical learning algorithm: Alternating Token‑Weighted Unlearning (ATWU). We first relax the discrete optimization problem, then introduce the hidden‑state scorer, specify and justify the selection of forget loss used in our experiments, and finally describe the alternating optimization procedure.

Lagrangian relaxation. The constrained problem (4) is combinatorial in z. We therefore instead optimize a continuous penalized objective

$$ \tilde{L}(\theta, z)=L(\theta, z)+\lambda_H\sum_{x,t\in z} H(z_{x,t})+\lambda_\rho\frac{1}{N_F}\sum_{x,t\in z}\bigl(z_{x,t}-\rho\bigr)^2, $$

where $H(z)=-z\log z-(1-z)\log(1-z)$ is the binary entropy. The objective $\tilde{L}(\theta, z)$ relaxes the original combinatorial problem (4) on two fronts: the binary constraint $z\in\{0,1\}^{N_F}$ is relaxed to $z\in[0,1]^{N_F}$, and the budget constraint $\sum_{x,t} z_{x,t}= \rho N_F$ is replaced by the quadratic penalty. The entropy term counters the first relaxation by pushing each $z_{x,t}$ back towards $\{0,1\}$. This relaxation can be exact under sufficiently large regularization.

Lemma 4.1 (Exactness of the relaxation, informal). For sufficiently large $\lambda_H$ and $\lambda_\rho$, every global minimizer of the relaxed objective (5) over $[0,1]^{N_F}$ is binary, satisfies the budget constraint, and is a global minimizer of the constrained problem (4).

The formal statement and proof are given in Lemma B.1. While this free‑token‑weight formulation provides strong theoretical guarantees, learning an independent scalar for every token is unscalable for large corpora and cannot generalize to unseen sequences. To bridge this gap from theory to practice, ATWU parameterizes the selector using a shared scoring function $g_w$. Under this scalable parameterization, the entropy and budget terms naturally transition from enforcing exact combinatorial constraints to acting as principled regularizers, successfully guiding the network toward sparse, budget‑controlled token selection.

To instantiate this shared scorer, we parameterize token forget specificity using a simple linear projection over the language model’s hidden representations. This architectural choice is directly motivated by the structural probing literature: because intermediate hidden states natively encode rich linguistic and semantic properties (Tenney et al., 2019), they provide an ideal representation space for identifying forget‑specific patterns without requiring a complex auxiliary network. Let $h_\theta(x_t)\in\mathbb{R}^d$ denote the $t$-th hidden representation produced by $p_\theta$ for the sequence $x$. ATWU defines

$$ \hat{z}_{x,t}=g_w\bigl(h_\theta(x_t)\bigr)=\sigma\bigl(\langle w,\,h_\theta(x_t)\rangle\bigr)\in(0,1), $$

where $w\in\mathbb{R}^d$ is the learned scorer parameter.

This single linear projection keeps the mechanism lightweight while encouraging reusable token‑level patterns rather than independent per‑token decisions. We initialize $w=0_d$, meaning all tokens initially receive a uniform score of $1/2$. Consequently, the objective begins token‑agnostic and smoothly becomes selective during training. As a proof of concept, Appendix E.1 shows that this exact linear architecture can recover ground‑truth forget labels when trained with explicit supervision, i.e., the scorer model has enough capacity to model the true forget specificity; ATWU, however, learns the scores entirely implicitly from the unlearning objective.

Choice of loss functions. We use standard cross‑entropy as the retain loss and a saturated negative cross‑entropy introduced by Wang et al. [2025b] as the base forget loss. For a hyperparameter $\beta>0$, the token‑wise forget loss is defined as

$$ \ell_f(x_t\mid x_{<t};\theta):=p_\theta(x_t\mid x_{<t})^{\beta}\,\log p_\theta(x_t\mid x_{<t}). $$

This loss formulation incorporates the saturation weighting mechanism utilized by Wang et al. [2025b] inside the loss, which we will refer to as SATGA. While saturation alone is a poor proxy for token‑level forget‑specificity (as evidenced by the poor AUROC of the saturation score in Fig. 2), it provides highly desirable gradient‑scaling properties for the loss function.

Crucially, unlike raw negative cross‑entropy $\ell_{\text{GA}}$, this saturated variant is bounded below. As a function of $p\in(0,1]$, the term $p^{\beta}\log p$ reaches a minimum of $-1/(e\beta)$ at $p=e^{-1/\beta}$. Consequently, the hyperparameter $\beta$ effectively controls the target probability level for the forget update, with smaller $\beta$ values enforcing stronger suppression.

To instantiate our general joint‑learning framework using this bounded objective, we introduce a slight modification: we also inject the dynamically learned token score into the saturation exponent. We refer to this score‑modulated variant as SATGA\+. Plugging this into the joint formulation yields our primary ATWU objective:

$$ L_{\text{ATWU}}(\theta,w)=\alpha R(\theta)+\gamma\sum_{x,t} g_w(x_t)\,p_\theta(x_t\mid x_{<t})^{\beta g_w(x_t)}\log p_\theta(x_t\mid x_{<t}) +\lambda_H\sum_{x,t} H\bigl(g_w(x_t)\bigr) +\lambda_\rho\frac{1}{N_F}\sum_{x,t}\bigl(g_w(x_t)-\rho\bigr)^2, $$

where $g_w(x_t):=g_w\bigl(h_\theta(x_t)\bigr)$. For strictly binary scores, the SATGA\+ modification reduces to standard SATGA. However, because our scores $g_w$ are continuous during training, the modified exponent forces uncertain tokens to exert a much smoother, attenuated forget update. This stabilizes the early phases of optimization before the scorer has fully converged, an effect we ablate directly in Table 3. The coefficients $\alpha$ and $\gamma$ denote the retain and forget loss weights, respectively. While we absorbed these constants into the base loss definitions in earlier sections for notational simplicity, we make them explicit here to reflect our exact empirical objective.

Alternating optimization. We optimize $L_{\text{ATWU}}(\theta,w)$ by alternating between language‑model and scorer updates. With the scorer $w$ fixed, the language‑model parameters $\theta$ are updated using the current token‑weighted forget objective together with the retain objective. With $\theta$ fixed, the scorer $w$ is updated to improve token selection under the same regularized objective. During model updates, the scores are detached and treated as fixed coefficients; during scorer updates, $\theta$ is frozen and gradients flow only through $w$. This scheduled alternation is empirically more stable than updating the scorer and model in lockstep: the scorer changes the effective forget objective, while the model changes the hidden‑state geometry on which the scorer depends. Updating them on separate timescales reduces this feedback loop. Finally, in Appendix E.3 we present an ablation on the update frequency that shows that the alternating variant outperforms joint updates.

Table 1: ATWU achieves the best UQ on both benchmarks while preserving utility close to the original checkpoint, the only method to do so consistently across the six metric panels. Left: TOFU forget10 with Llama‑3.1‑8B‑Instruct; right: canonical RWKU ten‑subject batch with Phi‑3‑Mini‑4k‑Instruct. Higher‑FQ competitors such as JENSUN forget aggressively but collapse generation quality; RMU matches ATWU on retain degradation but trails on UQ. Best (bold) and second‑best (underlined) per column; utility values that are materially degraded (MMLU drop > 2 pp, Rep. drop > 5 %, or WR < 45) are shown in red.

FQ: forget quality, the relative reduction in worst‑case forget‑set judge score. RD: retain degradation, the relative loss in retain‑set judge score. UQ = [FQ − RD]$^+$: net forget–retain trade‑off. Definitions in Appendix C.5.

Experimental Evaluation

ATWU’s unlearning quality and token‑scorer performance across benchmarks.

The ATWU framework learns token‑level importance scores that focus forgetting on knowledge‑bearing tokens, preserving the model’s structural capabilities.

ATWU attains the highest unlearning quality on both benchmarks, with $UQ$ = 91.7 on TOFU forget10 and $UQ$ = 58.7 on RWKU, beating the next‑best methods by 5.8 and 6.8 percentage points respectively.

Table 4 (TOFU) and Table 2 (RWKU) report these scores; the runner‑up on TOFU is RMU, on RWKU is WGA.

**Figure 10.** ROC curves for token-level forget-relevance detection on TOFU forget10. ATWU obtains the highest AUROC among the compared scoring methods; the dashed diagonal denotes random scoring.

**Table 2.** RWKU ten-subject batch results of ATWU with various forget losses. $\Delta$UQ reports the gain over each method's vanilla counterpart in Table 1.

**Table 4.** ATWU outperforms other token-weighted unlearning approaches on TOFU forget10 (Llama-3.2-1B-Instruct). Best and second-best methods are highlighted.

Evaluation Metrics

We detail the metric suite used to assess forgetting, robustness, and utility.

Evaluating an unlearned model requires three orthogonal properties: (i) thorough removal of targeted knowledge, (ii) robustness of that removal to prompt perturbations, and (iii) preservation of retained knowledge and general utility.

The evaluation framework combines the TOFU (Task of Fictitious Unlearning) suite with the RWKU (Real‑World Knowledge Unlearning) benchmark to capture forgetting, robustness, and downstream utility in a single panel.

How does TOFU differ from the RWKU benchmark?

TOFU evaluates forgetting at the token level (extraction strength) and with LLM‑as‑judge robustness attacks, whereas RWKU uses ROUGE‑L on cloze, QA, and adversarial probes and serves mainly as a cheap tuning surrogate; the two complement each other rather than replace one another.

C.1 introduces Extraction Strength ($\text{ES}$), which measures how many answer tokens can be recovered by feeding progressively longer prefixes of the ground‑truth answer to the model.

C.2 defines four judge‑based robustness metrics: paraphrase score ($\text{JP}$), in‑context‑relearning score ($\text{JICR}$), their worst‑case combination ($\text{JW}$), and average retain quality ($\text{JAVG}$).

C.3 reports three utility probes: MMLU accuracy for world knowledge, Repetitiveness (Rep.) for n‑gram degeneration, and Win Rate (WR) from a side‑by‑side LLM‑as‑judge comparison.

C.4 describes the native RWKU panel, where ROUGE‑L recall is computed on fill‑in‑the‑blank (FB), open‑form QA, and adversarial (AA) probes; the aggregate scalar $N_{\Delta}$ is the difference between neighbor‑set and forget‑set “All” scores.

C.5 normalizes raw scores against the original checkpoint, yielding Forget Quality ($\text{FQ}$), Retain Degradation ($\text{RD}$), and the combined Unlearning Quality ($\text{UQ}$) used for final method ranking.

Main Results

Key hyperparameter settings and the resulting unlearning quality across methods.

We report the hyperparameter configurations selected by our two‑stage tuning and the resulting unlearning performance on the TOFU and RWKU benchmarks.

ATWU attains the highest unlearning quality on TOFU forget01.

ATWU UQ = 0.62 versus the strongest baseline UQ = 0.58.

**Table 6.** Selected hyperparameters on TOFU. 1B and 8B are the learning rates used on Llama-3.2-1B-Instruct and Llama-3.1-8B-Instruct, respectively. The method-specific columns are overloaded across methods: $\beta$ for DPO/NPO/SIMNPO/WGA, $\beta_1$ for SATIMP, $\tau$ (temperature) for ETW, $c$ (steering coefficient) for RMU; $\delta$ for SIMNPO, $\beta_2$ for SATIMP, slr for ATWU, $\ell$ (target layer) for RMU. ETW was evaluated only on 1B and RMU only on 8B; the non-evaluated column is marked —.

**Table 7.** Selected hyperparameters on the canonical ten-subject RWKU batch. The learning rate is the one used on Phi-3-Mini-4k-Instruct.

**Figure 7.** Agreement between the GPT-5.4 mini judge and human annotators on a 440-row sample of TOFU forget01. The judge matches the human label on ~96% of calls; errors split as 8 false positives and 7 false negatives.

Extended Results

Extended evaluation results across the TOFU and RWKU benchmarks.

This section reports the full set of quantitative results that complement the main findings, focusing on the performance of ATWU relative to prior unlearning methods.

Ablation Studies

We isolate which components of ATWU drive its performance by systematic ablations.

We now present a series of ablations that answer the core question “does removing this component hurt?” for each design choice in ATWU.

A single linear function of the final‑layer hidden states can already separate forget‑relevant tokens from structural ones when supervision is available.

Training a one‑layer linear scorer on Llama‑3.1‑8B‑Instruct for ~20 % of an epoch yields an AUROC distribution concentrated above 0.9 and perfectly recovers the ground‑truth forget span on the qualitative example.

Without any token‑level labels, the retain‑conflict objective can still learn a linear scorer that isolates answer‑bearing tokens, provided the scorer is trained against a retain model.

When the scorer is paired with the retain model (which never saw the forget set), the unsupervised objective recovers the same token subset (e.g., “community”, “LGBTQ”) that the supervised scorer finds.

ATWU’s scorer is not tied to a single forget loss; we can replace the forget‑side term in any autoregressive loss with its scorer‑weighted analogue while leaving the rest of the objective untouched.

**Figure 8.** **Supervised baseline.** A linear scorer trained with binary cross-entropy against the GT labels of Zhou et al. [2026] cleanly recovers the forget span after a fraction of an epoch. Bold tokens mark the ground-truth forget-relevant span.

**Figure 9.** Unsupervised scorer. Trained against the original target model (top), the ATWU objective fails to localize the GT forget span; trained against a retain model (bottom), the same objective recovers the answer-bearing tokens. Bold tokens mark the ground-truth forget-relevant span.

Replacing the forget‑side term in DPO, NPO, and SIMNPO with the scorer‑weighted version improves forget‑specific performance without altering the retain side.

Table 12 lists the exact replacements (e.g., r(s⁻) → $r^g$(s⁻) for DPO) and the experiments confirm that only the forget sequence receives scorer‑weighted token mass.

The population‑penalty regularizer ($\lambda$_$\rho$) is the primary driver of scorer quality.

Configurations with $\lambda$_$\rho$ = 10 achieve FQ ≥ 65, whereas any configuration with $\lambda$_$\rho$ = 0 remains in the 49–52 range, regardless of the entropy ($\lambda_{H}$) or ℓ₂ ($\lambda$_ℓ₂) settings.

Freezing a converged scorer (TF) yields the highest relative performance among the training‑procedure variants.

Table 14a shows TF reaches FQ = 73.7, UQ = 71.6, and RD = 2.1, surpassing the headline online run (FQ ≈ 70.5, UQ ≈ 66.6).

Updating the scorer every step ($n_s$ = 1) collapses performance, while updating every five steps ($n_s$ = 5) gives the best trade‑off.

Table 14b reports FQ = 31.3 for $n_s$ = 1, versus FQ = 70.5 for $n_s$ = 5 and FQ = 67.0 for $n_s$ = 10; the joint‑update variant only reaches FQ = 39.9.

Overall, these ablations demonstrate that ATWU’s gains stem from the learned scorer’s ability to focus the forget update, rather than from any particular loss formulation.

Experimental Setup

Details of benchmarks, models, and hyperparameter tuning for the experiments.

We organize the experimental setup into four parts: benchmarks and base models (Section D.1), the hyperparameter‑tuning protocol (Section D.2), the final unlearning runs and evaluation (Section D.4), and the method‑specific search ranges with selected configurations (Section D.3).

TOFU is a synthetic QA benchmark that fine‑tunes a base LLM on fictitious author biographies; we evaluate three forget splits (forget01, forget05, forget10) on two model sizes, Llama‑3.2‑1B‑Instruct and Llama‑3.1‑8B‑Instruct. RWKU targets real‑world knowledge about public figures; we use the 3.8 B‑parameter Phi‑3‑Mini‑4k‑Instruct and a fixed ten‑subject batch, supplementing the official code with retain‑set support.

Hyperparameter tuning proceeds in two stages. Stage 1 runs a coarse Bayesian search (via Optuna) on the largest forget split for each benchmark, fixing the learning‑rate schedule and only varying method‑specific coefficients. Stage 2 transfers the Stage 1 configuration and sweeps the learning rate on a log‑scale, selecting the best trial by ES∆ for TOFU and by N∆ for RWKU; SATIMP on RWKU uses the second‑best trial because the top trial under‑forgot.

**Table 5.** Top-3 Stage 2 learning-rate trials per method on RWKU, ranked by $N_{\Delta}$. Green shading marks the trial used for the final unlearning run: the top-$N_{\Delta}$ trial throughout, except for SATIMP, whose top trial was discarded for inadequate forgetting in favor of the second-best.

Questions & answers

What is the main contribution of this paper?

The paper introduces Alternating Token-Weighted Unlearning (ATWU), a method that treats token-level forget-specificity as a latent variable and uses a lightweight linear scorer over LLM hidden states to identify which tokens to suppress during unlearning, achieving state-of-the-art forget-retain trade-offs without requiring external annotations.

What problem does ATWU address?

ATWU addresses the problem that standard LLM unlearning methods apply a uniform forget loss to all tokens in a forget sample, including structural tokens like function words that do not encode targeted knowledge, which degrades the model's general linguistic capabilities while failing to precisely remove the unwanted information.

Why is token-level weighting necessary for LLM unlearning?

Forget samples contain both targeted knowledge-bearing tokens and structural linguistic patterns; applying a uniform forget loss to the entire sample harms general language utility, so selective weighting allows the model to target only the tokens that actually encode the unwanted knowledge.

How does ATWU work technically?

ATWU parameterizes token forget-specificity using a linear projection over the LLM's hidden representations, producing a continuous score per token via a sigmoid function. It then alternates between updating the language model parameters using the token-weighted forget objective and updating the scorer parameters, with entropy and quadratic penalty regularizers to encourage binary, budget-constrained selections.

How does ATWU identify forget-specific tokens without external supervision?

ATWU treats token weights as latent variables in a joint optimization problem and learns them directly from the interaction between the retain and forget objectives; because the model's hidden states already encode linguistic structure, the scorer identifies forget-specific tokens without any external annotations.

What is the 'retain conflict' concept used in ATWU?

Retain conflict is the key criterion for identifying forget-specific tokens: a token is deemed forget-specific if suppressing it does not force the model away from its retain-optimal state, meaning its removal targets only unwanted knowledge rather than general linguistic structure.

What forget loss does ATWU use and why?

ATWU uses a saturated negative cross-entropy loss (called SATGA+), where the learned token score is injected into both the weighting coefficient and the saturation exponent; this bounds the loss below, controls the target suppression probability via hyperparameter β, and smoothly attenuates uncertain tokens during early optimization before the scorer converges.

What benchmarks and models are used to evaluate ATWU?

ATWU is evaluated on TOFU (a synthetic QA benchmark with fictitious author biographies, using Llama-3.2-1B-Instruct and Llama-3.1-8B-Instruct across forget01, forget05, and forget10 splits) and RWKU (a real-world knowledge benchmark targeting public figures, using Phi-3-Mini-4k-Instruct on a fixed ten-subject batch).

What metrics are used to evaluate unlearning performance?

The paper uses Forget Quality (FQ, the relative reduction in worst-case forget-set judge score), Retain Degradation (RD, the relative loss in retain-set judge score), and Unlearning Quality (UQ = [FQ − RD]+, the net forget-retain trade-off), alongside auxiliary metrics including Extraction Strength (ES), judge-based robustness scores (JP, JICR, JW), MMLU accuracy, Repetitiveness, Win Rate, and ROUGE-L-based RWKU scores.

What are the key results of ATWU?

ATWU achieves the best Unlearning Quality (UQ) on both the TOFU forget10 benchmark with Llama-3.1-8B-Instruct and the canonical RWKU ten-subject batch with Phi-3-Mini-4k-Instruct, while preserving utility close to the original checkpoint; competing methods such as JENSUN forget aggressively but collapse generation quality, and RMU does not match ATWU's consistent performance across all six metric panels.

How does ATWU compare to prior unlearning methods?

ATWU outperforms both sample-level token-agnostic methods and heuristic-based token weighting approaches (including SATGA alone) on the forget-retain trade-off; the paper notes that high-quality token scores alone are insufficient and that the adaptive unlearning objective is also necessary to bridge the gap to supervised oracle performance.

What are the limitations of ATWU?

The paper notes that computing the singleton conflict for every token is computationally intractable, motivating the relaxed joint optimization; it also acknowledges that the free-token-weight formulation is unscalable for large corpora and cannot generalize to unseen sequences, which is why a shared linear scorer is used instead. The paper does not extensively discuss failure modes or out-of-distribution generalization beyond the tested benchmarks.

How does TOFU differ from the RWKU benchmark?

TOFU evaluates forgetting at the token level using extraction strength and LLM-as-judge robustness attacks on synthetic fictitious author biographies, whereas RWKU uses ROUGE-L on cloze, QA, and adversarial probes targeting real-world knowledge about public figures, and serves mainly as a cheap tuning surrogate; the two benchmarks complement rather than replace each other.

What theoretical guarantee does ATWU provide for its relaxation?

Lemma 4.1 (Exactness of the relaxation) states that for sufficiently large penalty coefficients λ_H and λ_ρ, every global minimizer of the relaxed continuous objective over [0,1] is binary, satisfies the budget constraint, and is a global minimizer of the original combinatorial constrained problem.

How is the scorer in ATWU designed and initialized?

The scorer is a single linear projection over the LLM's hidden representations followed by a sigmoid, parameterized by a single vector w ∈ ℝ^d; it is initialized to the zero vector, so all tokens initially receive a uniform score of 1/2, making the objective begin token-agnostic and smoothly become selective during training.

What hyperparameter tuning procedure does ATWU use?

Hyperparameter tuning proceeds in two stages: Stage 1 runs a coarse Bayesian search via Optuna on the largest forget split for each benchmark, varying method-specific coefficients; Stage 2 transfers the Stage 1 configuration and sweeps the learning rate on a log-scale, selecting the best trial by ES∆ for TOFU and by N∆ for RWKU.

Do the ablation studies confirm that the learned scorer is the key driver of ATWU's gains?

Yes; the ablation studies demonstrate that ATWU's gains stem from the learned scorer's ability to focus the forget update on knowledge-bearing tokens, rather than from any particular loss formulation, and that the scorer can be combined with different forget losses while still improving performance.

Who are the authors of this paper and where was it published?

The paper does not specify the authors' names or the publication venue in the provided text; it is available on arXiv at https://arxiv.org/abs/2606.06320.

Key terms

Machine unlearning: The process of removing specific knowledge or data from a trained model without retraining it from scratch, while preserving its general capabilities.
ATWU (Alternating Token-Weighted Unlearning): The method introduced in this paper that alternates between updating LLM parameters and a token scorer to selectively suppress only knowledge-bearing tokens during unlearning.
Forget-specific tokens: Tokens in a forget sample that actually encode the targeted knowledge to be removed, as opposed to structural tokens like function words that are irrelevant to the unlearning target.
Retain conflict: A measure of how much suppressing a given token forces the model away from its retain-optimal state; tokens with low retain conflict are considered forget-specific.
Token-level forget-specificity: A per-token score indicating how much a token contributes to the targeted knowledge to be forgotten, as opposed to general linguistic structure.
SATGA (Saturated Gradient Ascent): A forget loss variant using saturated negative cross-entropy, where the loss is bounded below to prevent over-forgetting, introduced by Wang et al. (2025b).
SATGA+ (Score-Modulated SATGA): ATWU's modification of SATGA that injects the learned token score into both the weighting coefficient and the saturation exponent, smoothly attenuating uncertain tokens during training.
Forget Quality (FQ): A normalized metric measuring the relative reduction in the worst-case forget-set judge score compared to the original checkpoint.
Retain Degradation (RD): A normalized metric measuring the relative loss in retain-set judge score compared to the original checkpoint, indicating how much general capability is harmed by unlearning.
Unlearning Quality (UQ): The combined metric defined as [FQ − RD]+, representing the net forget-retain trade-off and used for final method ranking.
Extraction Strength (ES): A metric measuring how many answer tokens can be recovered from the unlearned model by feeding progressively longer prefixes of the ground-truth answer.
TOFU: A synthetic QA benchmark that fine-tunes a base LLM on fictitious author biographies and evaluates unlearning across multiple forget splits using token-level and judge-based metrics.
RWKU: A real-world knowledge unlearning benchmark that targets knowledge about public figures and evaluates forgetting using ROUGE-L on cloze, QA, and adversarial probes.
Lagrangian relaxation: A mathematical technique that converts a constrained combinatorial optimization problem into a continuous unconstrained one by adding penalty terms for constraint violations.
Binary entropy (H): A regularization term H(z) = −z log z − (1−z) log(1−z) that pushes continuous token scores toward binary (0 or 1) values during optimization.
Structural probing: A research approach that tests whether simple linear classifiers applied to LLM hidden states can recover linguistic properties, motivating the use of hidden states as a representation space for the token scorer.
Alternating optimization: An iterative procedure that alternates between optimizing one set of parameters (here, the LLM weights) while holding others fixed (the scorer weights), and vice versa.
Budget constraint (ρ): A hyperparameter controlling the target fraction of tokens to be labeled as forget-specific, enforced via a quadratic penalty in the ATWU objective.
LLM-as-judge: An evaluation approach that uses a large language model to assess the quality of another model's outputs, used here for robustness metrics like paraphrase score and in-context relearning score.
MMLU (Massive Multitask Language Understanding): A benchmark measuring a model's accuracy across a wide range of academic subjects, used here as a proxy for retained world knowledge after unlearning.

Read the original paper

Open the simplified reader on Paperglide