LLMs Can Leak Training Data but Do They Want to? A Propensity-Aware Evaluation of Memorization in LLMs

Distinguishing between an LLM's ability to leak training data under attack and its propensity to do so in ordinary use.

Do LLMs leak training data during ordinary use, or only when specifically prompted to do so?

Current memorization evaluations rely on adversarial prefix attacks, which measure a model's maximum capability to reproduce training data but fail to predict how often it leaks information during normal, non-adversarial use. The authors introduce PROPME, a framework that contrasts adversarial capability with non-adversarial propensity, and SIMPLETRACE, a tool that deterministically attributes model outputs to training corpora. Results show that while models can be forced to leak data, their propensity to do so under ordinary prompting is consistently low, suggesting that standard adversarial benchmarks significantly overstate practical leakage risks.

Paper Primer

The paper addresses the gap between "capability" (what a model can be forced to output) and "propensity" (what a model tends to output). By comparing generic, non-adversarial prompts against targeted prefix attacks, the authors quantify the likelihood of data leakage in realistic deployment scenarios.

The core mechanism is a propensity-aware transformation: it maps standard memorization metrics into a normalized score that penalizes high capability if the corresponding propensity is low. This is powered by SIMPLETRACE, a high-speed tracing pipeline that uses suffix-array indexing to perform deterministic attribution of generated text back to the original training corpus.

Adversarial prefix attacks consistently elicit significantly higher memorization signals than non-adversarial prompts.

Across all models, prefix attacks yielded average longest span lengths (ALS) of ~50 tokens, compared to ~28 tokens for generic prompts, with a 36x difference in near-verbatim recall (NVR) observed in some settings. The gap between capability and propensity is substantial, with propensity scores remaining well below the neutral 0.5 threshold in almost all non-adversarial settings.

Continual pre-training on new data reduces the accessibility of previously memorized content.

The DFM Decoder model, which was continually pre-trained from the Comma model, exhibited lower memorization of the original Common Pile dataset compared to its parent. The reduction in memorization capability was consistent across later training stages, confirming that data mixture shifts can attenuate earlier memorization.

Why is it insufficient to just measure "memorization" as a single value?

Memorization is not a binary state but a spectrum; measuring only the worst-case extractability (capability) overstates the risk of leakage in ordinary use, which is the primary concern for legal and safety compliance.

Does low propensity mean a model is "safe" from leaking training data?

No; the authors emphasize that low propensity does not imply the absence of memorization. Specific, non-adversarial prompts can still occasionally recover training data, meaning propensity evaluation should complement, not replace, adversarial testing.

Memorization audits must report both worst-case capability and ordinary-use propensity to provide a realistic assessment of leakage risk. Future evaluations should adopt this dual-setting approach to avoid overestimating the practical dangers of model memorization.

Introduction: The Memorization Gap

We define memorization capability and propensity, exposing the gap between forced and ordinary data leakage.

Current memorization evaluations ask whether models can be forced to leak training data, but they do not ask whether models actually do so during normal use. This leaves an unmeasured gap between what a model can reproduce (capability) and what it does reproduce (propensity). Bridging this gap is essential for realistic risk assessment.

Capability measures the maximum amount of training data a model can be made to output when an adversary designs prompts to extract it.

Propensity quantifies how likely a model is to reproduce training data under ordinary, non‑adversarial prompts.

Prior work has largely treated memorization as a capability problem, employing adversarial attacks such as membership inference, prefix attacks, and resource‑referencing prompts. While these studies demonstrate that models can reproduce copyrighted text and personal identifiers when provoked, they leave open the question of how often such leakage occurs without adversarial prompting.

To fill this gap we introduce PROPME, a propensity‑aware evaluation framework, and SIMPLETRACE, a lightweight tracing pipeline that deterministically links model generations to their source documents. Together they enable a principled comparison of capability‑focused and propensity‑focused settings across models, datasets, and languages.

The key distinction is between what a model can memorize when forced and what it does memorize during ordinary use.

The PROPME Framework

We detail the PROPME evaluation framework and the SIMPLETRACE pipeline that measures propensity and capability.

To compare a model’s natural tendency to leak data (propensity) with its maximal leakage ability (capability), we introduce a two‑setting evaluation and a fast tracing pipeline.

PROPME quantifies the gap between a model’s ordinary‑use leakage (propensity) and its worst‑case leakage (capability) by normalising the two measurements into a single score.

How does the PROPME score differ from simply reporting the capability metric alone?

PROPME explicitly accounts for the model’s behaviour under ordinary prompts; a high capability score is down‑weighted if the same model shows little leakage in realistic use, whereas a raw capability number would ignore that contrast.

SIMPLETRACE extracts and aggregates exact‑match n‑gram spans from model generations, then computes a suite of memorization statistics that serve as the base metric $f_b$ for PROPME.

Why does SIMPLETRACE filter spans by unigram rarity instead of just keeping the longest matches?

Rare unigrams are less likely to appear by chance; keeping only the $K$ rarest spans removes spurious matches from high‑frequency boilerplate (e.g., common code headers), yielding a cleaner signal of genuine memorization.

Step 1 – Maximal span extraction: iterate over all $L\!-\!1$ suffixes of a generated token sequence of length $L$, query the suffix array, and keep the longest verbatim prefix that respects word boundaries.

Step 2 – Unigram rarity filtering: compute the joint unigram probability of each span, retain the $K = \lceil 0.05 \cdot L \rceil$ rarest spans, and discard the rest.

Step 3 – Document retrieval: for each retained span, look up matching training documents, classifying matches as full raw, full normalized, or partial span‑level.

Step 4 – Span merging and aggregation: greedily merge overlapping spans into non‑redundant segments, then aggregate statistics across the batch.

Step 1 finds a maximal verbatim span “the quick brown fox” (5 tokens) that appears in the training corpus.

Step 2 computes its unigram probability $p = 1.2 \times 10^{-6}$, which is the lowest among all candidate spans, so it is retained.

Step 3 retrieves the training document containing that span and classifies it as a full raw match.

Step 4 merges the span (no other spans exist) and records a memorization length of 5 tokens for this generation.

This toy run shows how the rarity filter discards many common phrases, leaving only a highly informative match that directly contributes to the final memorization statistics.

**Figure 1.** **Left:** PROPME framework overview with propensity and capability prompts, back-tracing to full training set and memorization/propensity measurements. **Right:** propensity metrics results for different combinations of models and dataset, this tells us what is the propensity of a given model to leak data of a certain dataset. The metrics used are defined and detailed in Sections 2, 3.2 4.3

Experimental Setup

We describe datasets, models, prompting settings, and evaluation metrics used in our experiments.

An unbounded $n$‑gram index that can retrieve sequences of any length, avoiding the fixed‑window limitation of traditional $n$‑gram tables.

How does Infini‑gram differ from a conventional fixed‑order $n$‑gram index?

A fixed $n$‑gram index only stores substrings of a predetermined length $n$, so queries longer than $n$ must be approximated. Infini‑gram stores every possible substring, so a query of any length can be answered exactly without approximation.

A massive English web‑scraped corpus (≈521 GB, 463.6 B tokens) that serves as a standard pre‑training source for large language models.

Why is the Common Pile split into three balanced shards?

Sharding balances I/O load across the 128 CPU cores and fits the 350 GB memory budget, allowing each shard to be processed in parallel without exceeding RAM limits.

A Danish language dataset (≈5.66 M samples, 6.83 B tokens, 10.5 GB) used to evaluate cross‑lingual memorization behavior.

Does Dynaword’s smaller size affect the indexing strategy?

Because Dynaword fits comfortably within the 84 GB memory budget, the same Infini‑gram pipeline can be applied without altering shard counts, but the shorter processing time (≈3 h) reflects its reduced scale.

Index the Common Pile and Dynaword corpora with Infini‑gram, using the resource allocations listed above.

Train the Comma v0.1 model on the Common Pile corpus.

Continually pre‑train DFM Decoder Open v0.3 over 30 B tokens in three stages, mixing two‑thirds Dynaword and one‑third Common Pile.

Generate three prompt sets (Generic, Specific, Prefix) of 100 samples each; Generic and Specific via GPT‑5.5, Prefix by truncating random training examples.

Validate prompt overlap with SIMPLETRACE to ensure the non‑adversarial sets have lower training‑data overlap than the Prefix set.

Run each model on the three prompt sets using temperature 0 and greedy decoding; collect generations.

Compute ALS, FMR, and NVR with SIMPLETRACE, then apply the propensity‑aware transformation to FMR and NVR.

**Table.** Performance metrics for different query types.

Empirical Results

We report how memorization varies across prompts, models, and training stages, highlighting the dominance of prefix attacks.

Prefix attacks elicit the strongest memorization, achieving an average verbatim span of 50.35 tokens.

Table 1 shows ALS 50.35 for the prefix setting, versus 27.95 (generic) and 29.47 (specific) in the Comma model.

All other prompt settings produce substantially lower spans, confirming that non‑adversarial prompting contributes only a modest fraction of the memorization signal.

**Figure 7.** Propensity scores for Common Pile across training stages of the DFM Decoder model.

**Figure 10.** Span length distributions for Common Pile (Comma model) across generic, specific, and prefix prompt settings.

Related Work

We survey prior work on memorization evaluation, distinguishing capability from propensity.

We introduce PROPME, the first framework for propensity‑aware evaluation of memorization in large language models. It spans settings from pure propensity measurement to capability‑focused evaluation, enabling direct comparison of a model’s willingness to memorize.

Prior work on memorization can be organized along two axes: the type of target model (closed/commercial versus open) and the measurement method (internal signals such as activations, weights, or output probabilities versus external text comparisons). These axes give a coarse taxonomy of existing studies.

Within this taxonomy, research either detects whether a sequence appeared in training or extracts it via adversarial prompting. Detection treats memorization as a binary property, while extraction measures the ability to reproduce training data under targeted conditions.

Recent literature distinguishes capability—behaviours a model can exhibit when successfully elicited—from propensity—behaviours a model tends to exhibit under typical contexts. Most memorization evaluations focus on capability, ignoring how often models spontaneously reproduce training data.

A variety of token‑level metrics have been proposed, including verbatim memorization length, fraction of extractable sequences, longest common subsequence, and near‑verbatim recall (nv‑recall). These metrics differ in how they aggregate matches and the thresholds they impose.

Tracing tools such as Infini‑gram and OLMoTrace enable large‑scale identification of training data that appears in model generations. Building on this infrastructure, SIMPLETRACE adds an indexing step, unigram precomputation, multi‑worker parallelization, and a dedicated metrics aggregation pipeline, and is released as open‑source.

Limitations and Caveats

We discuss the scope limits, ethical implications, and future directions of our memorization study.

Our focus on direct comparison against full training corpora yields high measurement accuracy but limits applicability to models whose training data is not publicly available. The propensity transformation and the PROPME framework are nevertheless architecture‑agnostic and can be combined with logit‑, weight‑, or probability‑based memorization methods when training data access is unavailable. Our experiments cover a single model family – four checkpoints derived from two base models, three of which are continual pre‑trainings of the fourth – and two languages.

Extending the analysis to broader model architectures and additional languages would help clarify how architectural choices and multilingual training interact with memorization propensity. Finally, our results leave open the question of how data mixture composition affects memorization: it remains unclear whether mixing same‑language data produces effects comparable to those observed here under cross‑lingual mixtures of Dynaword and Common Pile.

All experiments are conducted on models trained exclusively on open, permissibly licensed data and intended for research use, as in our case. Our findings confirm that adversarial elicitation can surface memorized content even when propensity under ordinary prompting is low, underscoring the importance of capability‑level evaluation alongside propensity‑level assessment. We release SIMPLE‑TRACE as open‑source to support transparent and reproducible research.

While any tool enabling output‑to‑training‑data tracing could in principle be misused, we believe the accountability benefits outweigh this risk, particularly given that the tool requires full access to the training corpus. Lower memorization propensities should not be used to “green‑wash” potential copyright infringement problems. Yet, we envision that understanding memorization propensities could be one of several factors for informing copyright law in the future.

The research was supported in part by the Danish Foundation Models project, funded by the Danish government. This research was further supported in part by the MIST project, funded by the Novo Nordisk Foundation under grant reference number NNF25OC0103204. Part of the computation done for this project was performed on the UCloud interactive HPC system managed by the eScience Center at the University of Southern Denmark.

Metric Validation

We validate SIMPLETRACE on Common Pile and Dynaword, showing near‑perfect retrieval.

We validate SIMPLETRACE in two stages: controlled unit tests on a dummy index and end‑to‑end checks on real corpora. For each corpus we sample 25 documents, issue one full‑document query and three 128‑token partial queries (start, middle, end), and count a query as a pass if the original source document is retrieved or an exact span covering the query text is returned. The unit tests confirm deterministic span recovery and correct summary statistics, while the end‑to‑end runs assess retrieval and attribution on Common Pile and Dynaword.

**Table 5.** End-to-end validation of SIMPLETRACE on Common Pile. Results are computed over 25 sampled documents, with one full-document query and three partial queries (start, middle, end) per document.

On Dynaword the pipeline achieves perfect scores: both source‑document retrieval and exact‑match rates are 1.00 for all query types. For Common Pile the only deviation is the partial‑start retrieval (0.96), yet the pass rate remains 1.00 because the exact span is still found elsewhere. The single missed document identifier arises from duplicate texts filling the current limit of ten retrieved documents, suggesting that raising this cap would resolve the issue.

Prompt Validation Analysis

Detailed ablation analyses of memorization stability across training stages and prompt settings.

Prompt validation is reported in Figure 2. The overlapping metrics rise monotonically from generic to specific to prefix prompts, cleanly separating propensity from capability scenarios.

**Figure a.** Average near-verbatim recall between prompts and Dynaword.

Memorization on Dynaword is examined across three training stages. Figures 3, 4, and 5 present SIMPLETRACE metrics, propensity scores, and span‑length histograms respectively.

**Figure b.** Fraction of prompts verbatim matched in Dynaword.

**Figure c.** Average near-verbatim recall between prompts and Common Pile.

**Figure 2.** (d) Fraction of prompts verbatim matched in Common Pile. Evaluating overlapping between prompts and datasets across all prompt settings (Dynaword top, Common Pile bottom).

Overall, Dynaword memorization is established early and does not intensify with continued pre‑training, matching prior observations that later‑stage data dominate memorization.

Analogous analyses for Common Pile are shown in Figures 6, 7, and 8. The three training stages again exhibit near‑identical SIMPLETRACE and propensity values.

**Figure a.** Average near-verbatim recall per prompt setting and training stage.

**Figure 4.** Propensity scores for Dynaword across training stages of the DFM Decoder model.

**Figure 5.** Span length distributions for Dynaword across training stages and prompt settings.

Thus, for both corpora, memorization profiles are stable across the three training checkpoints, confirming that continual pre‑training on Dynaword does not alter the underlying memorization behavior.

Figure 11 contrasts the two corpora: Common Pile spans shift toward longer fragments, while Dynaword concentrates on shorter spans, especially under non‑adversarial prompts.

**Figure 8.** Span length distributions for Common Pile across training stages and prompt settings.

Figure 9 revisits Dynaword span‑length histograms for the main experiments (Stages 1, 2, Final). The pattern matches the appendix: generic and specific prompts stay short, prefix attacks broaden the distribution.

**Figure.** Comparison of metrics across generic, specific, and prefix categories for Stage 1, Stage 2, and Final series.

Discussion and Conclusion

We compare forced memorization capability with ordinary prompting propensity and note how training data shifts affect leakage.

The paper argues that existing memorization tests measure only forced leakage (capability) and miss how often models leak data under normal use (propensity).

Capability evaluations produce roughly twice the recall of propensity evaluations.

Prefix attacks consistently elicit higher near‑verbatim recall, more full‑generation matches, and longer verbatim spans than generic prompts.

Memorization levels remain stable across later DFM training stages, showing that continued training on the same mixture does not monotonically increase accessible memorization.

Training on a partially different corpus (the DFM Decoder) reduces accessible memorization of earlier data, confirming that data‑mixture changes can lower leakage.

Prompt Settings Details

Appendix C lists prompt generation instructions and detailed memorization metrics.

Sections E.1 and E.2 specify how to generate start‑of‑sentence prompts for each domain, first by creating ten generic prompts per domain in JSONL format, then by crafting ten domain‑specific prompts without copying from the dataset. Section F lists extra memorization measures such as k‑eidetic memorization, near‑duplication count, ROUGE‑L, and token accuracy, each quantifying a different aspect of data leakage. Section H defines the SIMPLETRACE per‑document and corpus‑level metrics—including $nv\_recall$, $nv\_matched\_words$, $nv\_reference\_words$, $min\_span\_length$, $max\_span\_length$, $nv\_candidate\_words$, $nv\_missing\_words$, $nv\_additional\_words$, $total\_generations$, $generations\_with\_spans$, $n\_token\_span\_ratio$, $generations\_with\_n\_token\_span\_ratio$, $generations\_full\_matches\_ratio$, $generations\_full\_normalized\_matches\_ratio$, $total\_spans$, $total\_docs$, $average\_longest\_span\_length$, $unique\_total\_docs$, $generations\_above\_nv\_recall\_threshold$, $generations\_above\_nv\_recall\_threshold\_ratio$, $docs\_above\_nv\_recall\_threshold$, $spans\_length\_counts\_distribution$, $spans\_length\_distribution$, $full\_exact\_matches$, $unique\_full\_matches$, $unique\_full\_matches\_ratio$, $full\_normalized\_matches$, $unique\_full\_normalized\_matches$, $unique\_full\_normalized\_matches\_ratio$, $partial\_matches$, $unique\_partial\_matches$, $avg\_nv\_recall$, $max\_nv\_recall$, $docs\_with\_nv\_recall$, $total\_nv\_matched\_words$, $generations\_with\_nv\_recall$, $generations\_with\_nv\_recall\_ratio$, and $nv\_recall\_threshold$, providing a comprehensive view of memorization behavior.

Questions & answers

What is the main contribution of this paper?

The paper introduces PROPME, a framework that contrasts adversarial memorization capability with non-adversarial memorization propensity, and SIMPLETRACE, a deterministic tracing pipeline that attributes model outputs to training corpora, enabling a more realistic assessment of data leakage risk in large language models.

What problem does PROPME address?

PROPME addresses the gap between what a model can be forced to reproduce (capability) and what it actually tends to reproduce during normal use (propensity), since existing adversarial-prefix-based evaluations measure only worst-case extractability and overstate practical leakage risk for legal and safety compliance purposes.

Why is measuring only adversarial memorization capability insufficient?

Memorization is a spectrum rather than a binary state, and measuring only worst-case extractability ignores how rarely leakage occurs under ordinary, non-adversarial prompting, which is the scenario most relevant to real-world deployment and compliance.

How does the PROPME framework work?

PROPME applies a propensity-aware transformation that maps standard memorization metrics into a normalized score, down-weighting high capability scores when the corresponding propensity under ordinary prompts is low, enabling direct comparison across models, datasets, and languages.

What is SIMPLETRACE and how does it work?

SIMPLETRACE is a high-speed tracing pipeline that uses suffix-array indexing and Infini-gram to deterministically attribute model-generated text back to source documents in the training corpus; it adds unigram precomputation, multi-worker parallelization, and a metrics aggregation pipeline on top of existing infrastructure like Infini-gram and OLMoTrace, and is released as open-source.

Why does SIMPLETRACE filter spans by unigram rarity?

Rare unigrams are less likely to appear by chance, so keeping only the K rarest spans removes spurious matches from high-frequency boilerplate such as common code headers, yielding a cleaner signal of genuine memorization.

What datasets and models were used in the experiments?

The experiments use two corpora—Common Pile (requiring a 350 GB memory budget split into three balanced shards across 128 CPU cores) and Dynaword (fitting within an 84 GB memory budget, indexed in approximately 3 hours)—evaluated across multiple training checkpoints (Stages 1, 2, and Final) of models trained on open, permissibly licensed data.

What are the key empirical results?

Non-adversarial prompting produces substantially lower memorization spans than adversarial prefix attacks across all settings, confirming that propensity to leak data under ordinary prompting is consistently low; memorization profiles are also stable across the three training checkpoints for both corpora.

How does continued pre-training affect memorization propensity?

Memorization levels remain stable across later training stages for both Dynaword and Common Pile, showing that continued training on the same data mixture does not monotonically increase accessible memorization; additionally, training on a partially different corpus (the DFM Decoder) reduces accessible memorization of earlier data.

Does low propensity mean a model is safe from leaking training data?

No; the authors emphasize that low propensity does not imply the absence of memorization, since specific non-adversarial prompts can still occasionally recover training data, meaning propensity evaluation should complement rather than replace adversarial capability testing.

How does SIMPLETRACE compare to prior tracing tools like Infini-gram and OLMoTrace?

SIMPLETRACE builds on Infini-gram and OLMoTrace infrastructure but adds an indexing step, unigram precomputation, multi-worker parallelization, and a dedicated metrics aggregation pipeline, and is released as open-source; Infini-gram itself differs from fixed-order n-gram indexes by storing every possible substring so queries of any length can be answered exactly.

How was SIMPLETRACE validated?

Validation used controlled unit tests on a dummy index and end-to-end checks on real corpora, sampling 25 documents per corpus with one full-document query and three 128-token partial queries each; on Dynaword, source-document retrieval and exact-match rates were both 1.00, and on Common Pile the only deviation was a partial-start retrieval rate of 0.96, with a pass rate still at 1.00.

What are the limitations of the PROPME framework?

The framework's direct comparison against full training corpora limits applicability to models whose training data is publicly available; the paper also does not extend analysis to broader model architectures, additional languages beyond those tested, or clarify how data mixture composition affects memorization propensity.

How does PROPME differ from prior memorization evaluation approaches?

Prior work treats memorization as a capability problem using adversarial attacks such as membership inference and prefix attacks, measuring only whether models can reproduce data when provoked; PROPME is described as the first framework to explicitly measure propensity—how often models reproduce training data under typical, non-adversarial prompting—and to contrast it with capability.

What practical recommendation does the paper make for memorization audits?

The paper recommends that memorization audits report both worst-case capability and ordinary-use propensity to provide a realistic assessment of leakage risk, and that future evaluations adopt this dual-setting approach to avoid overestimating practical dangers.

Does the paper address potential misuse of SIMPLETRACE?

Yes; the authors acknowledge that a tool enabling output-to-training-data tracing could in principle be misused, but argue that accountability benefits outweigh this risk, particularly because the tool requires full access to the training corpus; they also caution that lower memorization propensities should not be used to 'green-wash' potential copyright infringement problems.

Who funded this research and where was it conducted?

The research was supported in part by the Danish Foundation Models project funded by the Danish government, and by the MIST project funded by the Novo Nordisk Foundation under grant reference number NNF25OC0103204; computation was performed on the UCloud interactive HPC system managed by the eScience Center at the University of Southern Denmark.

How can PROPME be applied when training data is not publicly available?

The paper states that the propensity transformation and PROPME framework are architecture-agnostic and can be combined with logit-, weight-, or probability-based memorization methods when direct access to training data is unavailable.

Key terms

PROPME: A propensity-aware evaluation framework introduced in this paper that contrasts a model's maximum capability to reproduce training data under adversarial prompting with its natural tendency (propensity) to do so under ordinary prompting.
SIMPLETRACE: A high-speed, open-source tracing pipeline that uses suffix-array indexing and Infini-gram to deterministically attribute text generated by a language model back to specific documents in its training corpus.
memorization capability: The maximum ability of a language model to reproduce training data when subjected to targeted adversarial prompting, representing a worst-case leakage scenario.
memorization propensity: The natural tendency of a language model to reproduce training data during ordinary, non-adversarial use, representing how often leakage actually occurs in realistic deployment.
adversarial prefix attack: A targeted prompting technique that provides a model with the beginning of a training document to elicit verbatim reproduction of the rest, used to measure worst-case memorization capability.
Infini-gram: An indexing system that stores every possible substring of a text corpus so that substring queries of any length can be answered exactly, unlike fixed-order n-gram indexes that only handle substrings up to a predetermined length.
suffix-array indexing: A data structure that sorts all suffixes of a text corpus alphabetically, enabling fast exact-match search for any substring within the corpus.
unigram rarity filtering: A SIMPLETRACE step that retains only the K matched spans whose constituent words are least frequent in the corpus, removing spurious matches caused by common boilerplate text.
propensity-aware transformation: A mathematical normalization in PROPME that adjusts a raw capability score downward when the model's propensity to leak data under ordinary prompts is low, producing a more realistic leakage estimate.
Common Pile: One of the two training corpora used in the experiments, requiring a 350 GB memory budget and split into three balanced shards for parallel processing across 128 CPU cores.
Dynaword: One of the two training corpora used in the experiments, smaller than Common Pile and fitting within an 84 GB memory budget, indexed in approximately 3 hours using the same Infini-gram pipeline.
membership inference: A technique for determining whether a specific data sample was part of a model's training set, typically using the model's output probabilities or other internal signals.
verbatim memorization: The exact reproduction of a sequence of tokens from training data in a model's generated output, as opposed to paraphrased or approximate recall.
k-eidetic memorization: A memorization metric that counts how many times a sequence appears in training data, with lower k values indicating sequences memorized from fewer repetitions.
ROUGE-L: A text similarity metric based on the longest common subsequence between a generated output and a reference text, used here as one of several supplementary memorization measures.
DFM Decoder: A model variant trained on a partially different corpus from the main Danish Foundation Models project, whose training on different data was found to reduce accessible memorization of earlier training data.
Danish Foundation Models project: A Danish government-funded research initiative that produced some of the open, permissibly licensed models evaluated in this paper.

Read the original paper

Open the simplified reader on Paperglide