Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov
A family of open-weights LLMs optimized for dialogue, featuring improved pretraining and iterative RLHF alignment.
How can we scale and align large language models to be both highly helpful and robustly safe for public release?
Open-source models have historically lagged behind closed-source "product" LLMs because they lack the extensive, transparent fine-tuning required to align with human preferences. The authors release Llama 2, a collection of models up to 70B parameters, using a refined pretraining recipe and a multi-stage alignment pipeline that combines supervised fine-tuning with iterative Reinforcement Learning with Human Feedback (RLHF). On human evaluations, the 70B chat variant is competitive with ChatGPT and significantly outperforms existing open-source alternatives in helpfulness and safety.
Paper Primer
The authors improve the base model by increasing the pretraining corpus by 40%, doubling the context length, and adopting Grouped-Query Attention (GQA) to improve inference scalability for larger models. They intentionally avoid aggressive pretraining data filtering to maintain model utility for diverse downstream tasks, such as hate-speech detection.
To align the models for dialogue, the authors use a two-pronged approach: Supervised Fine-Tuning (SFT) on high-quality human-annotated data, followed by iterative RLHF. They introduce Ghost Attention (GAtt), a method that synthetically augments training data to help the model maintain system-level constraints (like "act as X") over long, multi-turn conversations.
Llama 2-Chat 70B is competitive with closed-source models like ChatGPT.
Human evaluators preferred Llama 2-Chat 70B over ChatGPT in 36% of cases, with a 31.5% tie rate, across a set of over 4,000 prompts. The model achieves a win rate of over 75% against equivalently sized open-source models like Vicuna-33B and Falcon-40B.
Iterative RLHF and GAtt improve dialogue consistency and safety.
GAtt maintains instruction consistency for 20+ turns, and the iterative RLHF process (V1–V5) consistently increased win rates against previous model versions. The latest Llama 2-Chat models exceed a 60% win rate against ChatGPT when judged by GPT-4.
Why use two separate reward models for RLHF instead of one?
Helpfulness and safety objectives often trade off. Training separate reward models allows the authors to optimize for both without the information mismatch that could lead to hallucinations or safety failures.
What is the primary scope of this release?
The authors release 7B, 13B, and 70B parameter models for research and commercial use. They explicitly note that safety testing was conducted primarily in English and that developers must perform application-specific safety tuning before deployment.
Llama 2 provides a transparent, reproducible blueprint for aligning open-weights models, effectively closing the performance gap between community-accessible models and closed-source commercial products.
Introduction
We introduce Llama 2‑Chat, a dialogue‑optimized LLM built by fine‑tuning Llama 2 with RLHF and safety data.
Large language models have become powerful assistants, but training and aligning them at the 70 B scale demands massive compute and extensive human annotation, limiting progress to a few organizations.
Compute the raw attention scores: 8 × 8 = 64 dot‑product operations.
Store the scores as a 64‑element float matrix → 64 × 4 B = 256 B.
Scale to a realistic 4 k‑token context and 70 B model: the matrix grows to 4 000 × 4 000 ≈ 16 M entries, requiring ≈ 64 MB per layer, which multiplies across dozens of layers.
The memory needed for a single attention layer quickly exceeds hundreds of gigabytes, illustrating why naïve scaling is infeasible without algorithmic tricks.
A dialogue‑tuned variant of Llama 2 that combines reinforcement learning from human feedback with safety‑focused data to produce helpful, low‑risk responses.
**Figure 2.** Win-rate % for helpfulness and safety between commercial-licensed baselines and Llama 2-Chat, according to GPT-4. To complement the human evaluation, we used a more capable model, not subject to our own guidance. Green area indicates our model is better according to GPT-4. To remove ties, we used $win/(win + loss)$. The orders in which the model responses are presented to GPT-4 are randomly swapped to alleviate bias.
**Table 49.** Distribution of mean sentiment scores across groups under the political ideology domain from the BOLD prompts.
**Table 50.** Distribution of mean sentiment scores across groups under the profession domain from the BOLD prompts.
The key shift is moving from raw, pretrained LLMs to aligned chat models that balance helpfulness with safety.
Pretraining Methodology
We scale Llama by cleaning data, feeding more tokens, and redesigning attention for longer contexts.
Scaling Llama to larger sizes exposed two bottlenecks: the original 2 k context limited long‑range reasoning, and the standard multi‑head attention grew quadratically in compute, hurting inference speed.
Our pretraining approach therefore adds three systematic changes: cleaner data, 40 % more tokens, and architectural tweaks that double context length while introducing grouped‑query attention.
We keep the same auto‑regressive transformer backbone but increase the amount of text it sees and restructure attention so that each query only attends to a shared set of keys, cutting the per‑token cost.
With $L\!=\!4{,}000$, each sequence contains $4{,}000$ tokens; the number of sequences is $2\!\times\!10^{12} / 4{,}000 = 5\!\times\!10^{8}$.
Attention cost per token scales as $O(L)$ for GQA versus $O(L^2)$ for standard multi‑head; the ratio is $4{,}000 / 2{,}000 = 2$× cheaper per token.
The total FLOPs drop from $5\!\times\!10^{8}\!\times\!4{,}000^2$ to $5\!\times\!10^{8}\!\times\!4{,}000\!\times\!2{,}000$, a $50\%$ reduction.
Training time shortens proportionally, allowing the same compute budget to finish earlier.
Doubling the context length while switching to GQA cuts the dominant quadratic attention term in half, delivering the same data exposure with far less compute.
How does Grouped‑Query Attention differ from the standard multi‑head attention used in Llama 1?
Standard multi‑head attention projects each head’s queries, keys, and values independently, yielding $h$ separate query matrices. GQA shares a single query projection across all heads, so only the keys and values remain head‑specific. This reduces the number of query‑key dot‑products from $h$ × $L^2$ to $L^2$, cutting the per‑token FLOPs while preserving the ability of each head to attend to different value subspaces.
We retain the Llama 1 byte‑pair encoding tokenizer, which splits every number into individual digits and falls back to raw bytes for unknown UTF‑8 characters, keeping the vocabulary at 32 k tokens.
The model uses RMSNorm for stable pre‑normalization, SwiGLU as the activation to increase representational capacity, and rotary positional embeddings (RoPE) to encode absolute positions as rotations, all of which were already present in Llama 1.
**Table 2.** CO$_2$ emissions during pretraining.
**Table 3.** Overall performance on grouped academic benchmarks compared to open-source base models.
Alignment and Fine-tuning
We detail the fine‑tuning pipeline, from supervised data to RLHF and the new Ghost Attention trick.
Fine‑tuning Llama 2‑Chat proceeds in three stages: a high‑quality supervised fine‑tuning (SFT) pass, a reward‑model‑driven RLHF loop, and the Ghost Attention (GAtt) augmentation that stabilises multi‑turn instructions.
RLHF treats a fine‑tuned language model as a policy and uses a learned reward model—trained on human preference comparisons—to steer the policy toward outputs humans prefer.
How does RLHF differ from ordinary supervised fine‑tuning?
Supervised fine‑tuning learns from a fixed (prompt, answer) dataset, whereas RLHF introduces a learned reward signal that ranks many possible answers and updates the model to maximise that signal, effectively turning the model into a policy that optimises for human preference.
Collect a high‑quality instruction‑tuning corpus (≈ 27 k examples) and split each entry into a prompt and an answer.
Concatenate all prompts and answers, inserting a special separator token between them.
Apply an autoregressive loss but mask out the prompt tokens so gradients flow only on answer tokens.
Use a cosine learning‑rate schedule (initial $2\times10^{-5}$), weight decay $0.1$, batch size $64$, and sequence length $4096$.
Train for two epochs on the assembled dataset.
Both prompts and answers are tokenised; a separator token
The concatenated sequence becomes
During loss computation, tokens belonging to the two prompts are masked (loss = 0), while answer tokens receive the cross‑entropy loss.
After two epochs the model learns to map the first prompt to its haiku and the second prompt to the explanation, despite the tiny dataset.
Masking prompts forces the model to treat the answer as the only learning signal, which dramatically reduces over‑fitting to prompt phrasing.
With a solid SFT foundation we train reward models, then close the loop via RLHF to align the policy with human preferences.
Gather a new batch of human preference comparisons after each policy update.
Update the reward model on the expanded dataset (one epoch, cosine LR, max LR $5\times10^{-6}$ for 70 B).
Choose a fine‑tuning algorithm: either Rejection Sampling (RS) or Proximal Policy Optimization (PPO).
For RS: sample $K$ candidates per prompt, score them with the current reward model, and keep the highest‑scoring candidate for a supervised update.
For PPO: treat the reward model as a scalar advantage estimator and perform clipped policy gradient updates.
Repeat the cycle, optionally mixing top samples from all previous iterations to avoid forgetting.
Rejection Sampling fine‑tuning loop
Draft 1 receives reward 0.62, draft 2 receives 0.71, draft 3 receives 0.55.
The algorithm selects draft 2 (the highest score) as the training target.
The supervised step updates the policy to increase the probability of producing draft 2‑style summaries.
Rejection Sampling amplifies the tail of the reward distribution: the maximum reward rises with $K$, while the median stays flat, yielding a larger gap for improvement.
Beyond reward‑driven updates we needed a mechanism to keep system‑level instructions alive across many dialogue turns.
GAtt injects a persistent “system” instruction into the first user turn and then silences its loss on all later turns, forcing the model’s attention to honour the instruction throughout the conversation.
We prepend “Answer only with emojis” to the first user message, yielding tokens
During the forward pass, the loss on the instruction token is computed for turn 1 but set to zero for turn 2.
The attention matrix learns to keep the instruction token’s representation active, so the second assistant output is forced to follow the emoji constraint.
Zero‑loss masking prevents the model from “forgetting” the system instruction while still allowing the instruction to influence later attention dynamics.
How does Ghost Attention differ from simply prepending the instruction to every turn?
If the instruction were added to every turn and trained normally, the model would learn to ignore it because the loss would penalise repeated tokens; GAtt instead masks the loss after the first turn, so the instruction becomes a persistent hidden cue rather than a repeatedly supervised token.
**Table 5.** SFT annotation — example of a helpfulness (top) and safety (bottom) annotation for SFT, where the annotator has written both the prompt and its answer.
**Table 6.** Statistics of human preference data for reward modeling. We list both the open-source and internally collected human preference data used for reward modeling. Note that a binary human preference comparison contains 2 responses (chosen and rejected) sharing the same prompt (and previous dialogue). Each example consists of a prompt (including previous dialogue if available) and a response, which is the input of the reward model. We report the number of comparisons, the average number of turns per dialogue, the average number of tokens per example, per prompt and per response. More details on Meta helpfulness and safety data per batch can be found in Appendix A.3.1.
**Table 7.** Reward model results. Performance of our final helpfulness and safety reward models on a diverse set of human preference benchmarks. Note that our model is fine-tuned on our collected data, as opposed to the other baselines that we report.
**Table 8.** Granular reward model accuracy per preference rating. We report per-preference rating accuracy for both Helpfulness and Safety reward models on the Meta Helpfulness and Safety test sets. The reward models show superior accuracy on more distinct responses (e.g., significantly better) and lower accuracy on similar responses (e.g., negligibly better).
**Figure 6.** Scaling trends for the reward model. More data and a larger-size model generally improve accuracy, and it appears that our models have not yet saturated from learning on the training data.
**Figure 7.** Max and median reward among N samples, $N \in [1, \dots, 100]$ averaged over our training set of prompts. The delta between max and median can be interpreted as potential gain with Rejection Sampling.
**Figure 8.** RLHF impact of the temperature when sampling N outputs and scoring them with a reward model
**Figure 9.** Issues with multi-turn memory (left) can be improved with GAtt (right).
**Figure 10.** Attention visualization for a dialogue with and without GAtt. We considered the maximum activations across the network and we bin neighboring tokens together.
**Figure 11.** Evolution of Llama 2-Chat. We show the evolution after multiple iterations fine-tuning for the win-rate % of Llama 2-Chat compared to ChatGPT. Left: the judge is our reward model, which may favor our model, and right, the judge is GPT-4, which should be more neutral.
**Table 32.** Number of prompts for human evaluations.
Safety Measurements and Mitigations
We quantify how each safety component impacts violation rates and robustness.
Safety is measured by how much each mitigation reduces harmful outputs while preserving helpfulness.
Safety RLHF cuts toxicity on the 70 B Llama 2‑Chat from 24.60 % to 0.01 %.
Table 14 reports the toxicity percentages before and after safety RLHF for the 70 B model.
Red‑team robustness $\gamma$ improves from 1.8 to 0.45 after applying all safety mitigations.
Section 4.3 reports the average number of violating prompts per person‑hour before and after safety refinements.
Llama 2‑Chat’s overall safety violation rate is roughly 30 % lower than the worst open‑source competitor.
Figure 3 shows Vicuna 33 B‑v1.3 at ~40 % violations while all Llama 2‑Chat variants stay below 10 %.
**Figure 3.** Safety human evaluation results for Llama 2-Chat compared to other open-source and closed-source models. Human raters judged model generations for safety violations across ~2,000 adversarial prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is important to caveat these safety results with the inherent bias of LLM evaluations due to limitations of the prompt set, subjectivity of the review guidelines, and subjectivity of individual raters. Additionally, these safety evaluations are performed using content standards that are likely to be biased towards the Llama 2-Chat models.
**Figure 15.** Safety data scaling trends. Left: as we increase the amount of safety data in model training, the mean safety RM score improves significantly while the helpfulness counterpart remains relatively stable. Right: the left tail of safety RM scores (i.e., most unsafe responses) gradually disappears with the addition of more safety training data.
**Figure 16.** Context distillation analysis. Left: Distribution of safety RM scores from the base model, when adding a generic preprompt, and when adding a preprompt based on the risk category with tailored answer template. While a generic preprompt increases safety RM scores, a preprompt with tailored answer template helps even more. Right: Context distillation increases the RM score significantly for samples that initially have a low score, but can also have a detrimental effect on samples that initially have a high score. We therefore only apply context distillation on targeted samples when it increases RM score.
Related Work
We survey recent LLM scaling, open‑source efforts, and alignment techniques shaping the field.
The field has progressed from early scaling‑law models such as GPT‑3, Gopher, and Galactica to token‑efficient systems like Chinchilla and the computationally‑lean Llama 1 (Touvron et al., 2023). Open‑source releases—including BLOOM, OPT, and Falcon—challenge closed‑source counterparts (GPT‑3, ChatGPT, Bard, Claude), while instruction‑tuned and RLHF‑enhanced models such as Vicuna, Alpaca, and chain‑of‑thought prompting aim to improve usability and safety. Safety literature highlights risks (bias, toxicity, data leakage) and mitigation efforts ranging from red‑team attacks to policy‑level discussions.
A family of 7 B–70 B parameter models designed for inference efficiency and ease of deployment.
Pretraining Evaluation
Llama 2 70B sets new state‑of‑the‑art scores across all evaluated benchmarks.
Llama 2 70B outperforms all open‑source baselines on MMLU, standard benchmarks, code generation, and QA tasks.
Tables 19‑24 show Llama 2 70B achieving the highest scores in every reported category.
The tables collectively demonstrate that scaling to 70 B parameters yields consistent improvements, and that the Llama 2 architecture benefits uniformly across diverse task families.
**Table 19.** Five-shot performance on the Massive Multitask Language Understanding (MMLU) benchmark
**Table 20.** Performance on standard benchmarks.
**Table 21.** Code generation results on Human-Eval and MBPP. We report 0-shot and 3-shot results for Human-Eval and MBPP respectively. For pass@100 and pass@80 scores, we use a temperature of 0.8 and top-p=0.95. For pass@1 scores, we use a temperature of 0.1 and top-p=0.95.
**Table 22.** (Left) NaturalQuestions. Exact match performance. (Right) TriviaQA. Zero-shot and few-shot exact match performance on the filtered dev set. For TriviaQA, we evaluate on Wiki validation subset.
This table presents the performance of various language models (MPT, Falcon, LLaMA 1, and LLaMA 2) across different parameter sizes on the GSM8k and MATH benchmarks.
**Table.** Comparison of performance between Baseline and + GAtt across different dialogue turns.
Pretraining Benchmarks
Key statistics and ablations of the preference data and reward‑model margins used in fine‑tuning.
We present the scale and composition of the Meta human preference data collected for RLHF, and evaluate how loss‑function variants affect reward‑model performance.
Adding a small preference‑rating‑based margin raises average helpfulness accuracy.
Table 28 reports 63.0 % average accuracy with the small margin versus 62.5 % without any margin.
Incorporating the safety auxiliary loss markedly improves unsafe‑response recall.
Table 29 shows recall rising from 73.0 % (baseline) to 90.4 % with the auxiliary loss.
**Figure 25.** Distribution of human preference data rating over batches. Over time, the share of samples with an unsure or negligibly better rating become larger with better performing Llama 2-Chat trained and available for preference data annotation.
**Figure 26.** Annotation curriculum. Evolution for each new batch of the maximum and median score given a reward model for prompts samples with a models trained on each of the batches. We can see that the score progressively decrease, suggesting that the prompts are on average harder in the most recent batches.
**Figure 27.** Reward model score distribution shift caused by incorporating preference rating based margin in ranking loss. With the margin term, we observe a binary split pattern in reward distribution, especially with a larger margin.
Safety Evaluation Results
Key safety and alignment findings for Llama 2‑Chat with GAtt.
GAtt enables Llama 2‑Chat to retain attribute recall at 100 % for up to 20 dialogue turns, whereas the baseline drops to 10 % by turn t + 3 and to 0 % thereafter.
Table 30 reports 100 % accuracy through turn 20 with GAtt versus rapid decay without it.
Beyond the 20‑turn memory test, GAtt also generalises zero‑shot: when instructed “answer in one sentence only,” the model obeys without degradation. Moreover, applying GAtt to Llama 1 (pre‑trained with a 2048‑token context) allowed the model to respect attributes beyond that window, hinting at scalability to longer contexts.
To probe reward‑model robustness we built a test set of safety and helpfulness prompts, collected 7‑point Likert ratings from annotators, and found the models’ scores well‑calibrated with human preference despite being trained with pairwise ranking loss.
Human evaluation covered >4 000 single‑ and multi‑turn prompts across five categories (factual, creative, persona, advice, reasoning). System prompts (Table 31) were prepended to each query; for open‑source models we limited context/generation to 1 000 tokens, while closed‑source models used 2 000 tokens.
Figure 32 visualises the safety‑helpfulness tension: safe responses can have low helpfulness (bottom‑right corner) and unsafe responses can be highly helpful (top‑left corner), illustrating the trade‑off addressed by separate reward models.
Scaling safety data improves refusal behaviour. Table 36 shows the model refusing offensive jokes at 50 % safety data; Table 38 demonstrates a shift from culinary to sexual interpretation of “sex in a pan” as safety data rises, with safety RM scores climbing from 0.53 to 0.92.
Pronoun categories follow the PaLM 2 taxonomy: 1st‑person (I, we), 2nd‑person (you), 3rd‑person (he, she, they, it), plus gender‑specific and unknown sets.
Context‑distillation preprompts (Table 39) prepend a concise instruction to the model before each turn, aiming to reduce hallucinations and improve consistency across multi‑turn interactions.
Safety‑error analysis (Table 40) reveals two failure modes: false refusals (e.g., refusing a factual query about the Republican Party’s elephant) and vague responses (e.g., non‑committal answers to racial‑stereotype prompts). Figure 33 quantifies the false‑refusal rate rise with higher safety‑data percentages.
**Figure 33.** The false model refusal rate increases with the percentage of safety data. Left: false refusal rate on the helpfulness dataset ranges from 0.006% (i.e., 1 occurrence) to 0.05% (i.e., 8 occurrences); Right: false refusal rate on the borderline dataset ranges from 15% to 27%.
**Figure 24.** Multi-query variants enable higher throughput with larger batch sizes, and show similar latency on smaller batches. Output length is fixed at 128 tokens. The first data point corresponds to batch size 1, and then we double it until the model runs out of memory. The MHA variant triggers an out-of-memory error at a batch size of 1024 for a context of 256 tokens and at a batch size of 128 for 2k context, whereas MQA and GQA have successful runs in those settings.
**Figure 32:** Safety and Helpfulness reward model scores on a set of safe (left) and unsafe (right) responses from the safety test set. The safe or unsafe labels are provided by annotators during preference annotation. Conflicts can be observed between the two aspects at the bottom right corner (i.e., high safety score but low helpfulness score) of the safe response plot and the top left corner (i.e., low safety score but high helpfulness score) of the unsafe response plot.
Annotation and Toxicity
Key toxicity and sentiment findings from the annotation study.
Toxicity rates across all demographic groups are at most 0.17%.
Table 45 reports percentages ranging from 0% to 0.17% for each group.
Fine‑tuned chat models consistently achieve higher sentiment scores than their pretrained counterparts across race, gender, and religious domains. The gap is most pronounced for Llama 2‑Chat, which leads in several categories. These trends indicate that the RLHF‑driven fine‑tuning improves perceived helpfulness while keeping toxicity negligible.
**Table 47.** Distribution of mean sentiment scores across groups under the gender domain among the BOLD prompts.
**Table 48.** Distribution of mean sentiment scores across groups under the religious ideology domain from the BOLD prompts.
Dataset Contamination Analysis
We quantify token‑level contamination using long n‑gram matches and assess its impact on benchmark scores.
Earlier works detect contamination by comparing raw text strings, ignoring how prompts are tokenized for evaluation. In contrast, we operate on tokenized inputs, feeding fully verbalized samples to the tokenizer before matching.
We deem a token contaminated when it belongs to any token $n$‑gram longer than $10$ tokens that appears in both the evaluation sample and the training corpus. The contamination percentage of a sample is the proportion of its tokens that meet this criterion, enabling us to isolate high‑precision clean (< 20 % contamination) and contaminated (> 80 % contamination) subsets. A “skip‑gram budget” of four tokens allows matched spans to differ in up to four positions, but we forbid mismatches in the first $10$ tokens or at the trailing end.