Tülu 3: A Fully-Open State-of-the-Art Post-Trained Model Family

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi

Tülu 3 provides a fully open-source post-training recipe that matches the performance of closed-source frontier models.

How can we create a fully open, state-of-the-art post-training recipe for language models that matches proprietary performance?

Post-training recipes for frontier language models are rarely transparent, leaving the open-source community to rely on outdated or undocumented pipelines. Tülu 3 bridges this gap by releasing a complete, multi-stage post-training framework—including data, code, and recipes—that uses supervised finetuning, preference tuning, and a novel reinforcement learning method with verifiable rewards. The resulting models outperform existing open-weight instruct models and match the performance of closed-source systems like GPT-4o-mini and Claude 3.5-Haiku.

Paper Primer

The Tülu 3 pipeline follows a four-stage process: data curation, supervised finetuning (SFT), preference tuning via length-normalized Direct Preference Optimization (DPO), and a final stage of Reinforcement Learning with Verifiable Rewards (RLVR). The core innovation in the final stage is the use of ground-truth verification for tasks like math and coding, which grants rewards only when the model's output is objectively correct.

Tülu 3 models achieve state-of-the-art performance among open-weight models and compete with closed-source frontier models.

Performance benchmarks across knowledge recall, reasoning, math, and coding, including comparisons against Llama 3.1 Instruct, Qwen 2.5, and Claude 3.5-Haiku. The 70B model matches the performance of GPT-4o-mini and Claude 3.5-Haiku on the Tülu 3 evaluation suite.

The authors emphasize rigorous decontamination of training data against their evaluation suite to ensure reported gains are genuine. They also report negative results, such as the finding that training for longer than two epochs in SFT did not yield further performance improvements.

Why does this paper focus on "verifiable rewards" instead of a traditional reward model?

Traditional reinforcement learning from human feedback (RLHF) relies on a learned reward model, which can be noisy or misaligned. RLVR uses objective ground-truth verification (e.g., code execution or math solution checking) to provide a constant reward, ensuring the model is reinforced only for correct reasoning.

What is the scope of the Tülu 3 release?

The release is comprehensive: it includes the final model weights, intermediate checkpoints, the full SFT and preference datasets, the training code, and the evaluation toolkit used for decontamination and benchmarking.

Researchers can now use the Tülu 3 recipe as a baseline for multi-stage post-training, shifting the focus from "black-box" proprietary methods to a transparent, reproducible pipeline for aligning base models.

Abstract

We present Tülu 3, a fully open, state‑of‑the‑art post‑training recipe for language models.

Post‑training refines language‑model behavior and adds new capabilities, but open, reproducible recipes are scarce. We introduce Tülu 3, a fully open, four‑stage pipeline (Supervised Finetuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR)) that matches or exceeds state‑of‑the‑art instruction‑tuned models, including closed systems such as GPT‑4o‑mini and Claude 3.5‑Haiku. The release supplies model weights, data, code, a multi‑task evaluation suite, and a detailed reproducibility report.

After a base language model is trained, additional fine‑tuning steps adjust its behavior for specific tasks without altering the underlying architecture.

Tülu 3 is a fully open, state‑of‑the‑art post‑training recipe.

Introduction

Open‑source post‑training pipelines lag behind proprietary ones, prompting a new open solution.

Recent advances in language‑model post‑training (instruction tuning, RLHF, etc.) are dominated by closed, proprietary pipelines. Open‑source alternatives such as Tülu 2 or Zephyr‑$\beta$ remain based on simpler, cheaper recipes and fall behind on most benchmarks.

The matrix has $N^2 = 128^2 = 16{,}384$ entries.

Storing each entry as a 4‑byte float requires $16{,}384 \times 4 \approx 64\text{ KB}$ of memory.

If we double the length to $N=256$, memory grows to $256^2 \times 4 \approx 256\text{ KB}$ – a quadratic increase.

Quadratic memory growth makes naïve attention prohibitive for long contexts, motivating more efficient pipelines.

To close this gap we introduce Tülu 3, an open‑source family that combines new data, a rigorous evaluation suite, and a four‑stage recipe (SFT, preference tuning, and RLVR) to match or exceed closed‑source performance. The recipe iteratively refines data mixes and hyper‑parameters, yielding checkpoints that excel across reasoning, math, coding, safety, and instruction following.

**Table 1.** Models, datasets, and code released with TULU 3.

Data Overview

Describes Tülu 3’s data curation, evaluation suite, and released artifacts.

Early post‑training followed the Instruct‑GPT recipe of instruction‑tuning then preference finetuning, and recent work has added multiple rounds, human + synthetic data, and varied objectives. Most successful models keep their data, code, and recipes hidden.

Open efforts such as Tülu 2 and Zephyr‑$\beta$ improve some benchmarks but still fall short on core capabilities like MATH, IFEval, and GSM8K. Moreover, no model in the top‑50 on ChatBotArena has released its post‑training data.

Tülu 3 pushes the boundary by integrating partial proprietary details, novel techniques, and academic research, while openly sharing data, recipes, and findings. Its success rests on careful data curation, rigorous experimentation, innovative methodologies, and improved training infrastructure.

Section 2.1 (“Tülu 3 Data”) describes how we curate data to target core skills—knowledge recall, reasoning, mathematics, coding, instruction following, general chat, and safety—using public sources and synthetic curation. Table 3 lists these capabilities and the corresponding evaluation benchmarks, while Table 7 enumerates the datasets collected for training.

**Table 3.** TÜLU 3 EVAL consists of development and unseen splits to evaluate core skills. With TÜLU 3 EVAL, we release a unified standardized evaluation suite and a toolkit to decontaminate training data against benchmarks. The subscript shows the metric we use for evaluation. TÜLU 3 Safety is a collection of safety evaluations taking the average score across them (avg*), see Sec. 7.2.1 for details.

Section 2.2 (“Tülu 3 Evaluation”) introduces Tülu 3 Eval, a unified, standardized evaluation suite and toolkit that decontaminates training data against benchmarks. The framework provides an open toolkit (Sec 7.1) and separate development and held‑out splits (Sec 7.2‑7.3), covering all identified skills except unseen safety.

**Table 24.** The TÜLU 3 Evaluation Regime: settings for development (top) and unseen (bottom) portions of the evaluation suite. CoT are evaluations run with chain of thought prompting (Wei et al., 2022b). #Shots is the number of in-context examples in the evaluation template. Chat refers to whether we use a chat template while prompting the model. Multiturn ICL refers to a setting where we present each in-context example as a separate turn in a conversation (applicable only when a chat template is used and # Shots is not 0). * Average over multiple sub-evaluations – full details of the safety evaluation are included in the Appendix.

All artifacts—including the data, recipe, and evaluation suite—are released publicly at https://github.com/allenai/olmes, enabling the community to reproduce and extend the open post‑training pipeline.

The Tülu 3 Recipe

Four sequential stages turn a base model into a state‑of‑the‑art post‑trained system.

Existing post‑training pipelines are often closed‑source and undocumented, leaving practitioners guessing which data mixes or algorithmic tweaks actually matter.

The recipe stitches together four focused training stages—curated data, supervised finetuning, preference optimisation, and verifiable‑reward RL—so that each stage can specialise on a subset of capabilities while preserving what previous stages have already learned.

Stage 1: Assign the two generic prompts to the “general” bucket and the two math prompts to the “skill‑specific” bucket.

Stage 2: Run one epoch of supervised finetuning on all four prompts; the model learns to answer both types, achieving 80 % accuracy on the generic set and 60 % on the math set.

Stage 3: Generate two completions per prompt, collect pairwise preferences, and apply length‑normalized DPO for one update; the math prompts improve to 75 % while generic accuracy stays at 80 %.

Stage 4: Define a verifiable reward that returns 1 only if the math answer equals 4; run a short RLVR loop (5 gradient steps) that pushes the math accuracy to 90 % without harming the generic prompts.

The staged design lets a cheap supervised pass establish a solid baseline, after which preference and RL steps can focus on the remaining weak spots without undoing earlier gains.

Each stage is backed by concrete engineering choices: data provenance checks in Stage 1, mixed‑skill data balancing in Stage 2, length‑normalisation to avoid reward bias in Stage 3, and an asynchronous vLLM‑based RL loop in Stage 4.

**Figure 1.** An overview of the Tülü 3 recipe. This includes: data curation targeting general and target capabilities, training strategies and a standardized evaluation suite for development and final evaluation stage.

Empirically, the pipeline yields consistent gains across all benchmarks; Table 5 shows the 70 B variant surpassing prior state‑of‑the‑art models, while Table 6 demonstrates similar improvements for the 8 B variant.

How does Tülu 3’s RLVR differ from the conventional RLHF approach that relies on a learned reward model?

RLHF first trains a separate reward model on human preferences, then uses that model as a proxy during RL. RLVR skips the proxy entirely: it only issues a binary reward when a generated answer can be verified against a ground‑truth solution (e.g., a correct math result). This eliminates reward‑model bias and reduces training complexity.

The four‑stage pipeline—SFT, preference tuning (DPO), and RLVR—provides a transparent, reproducible path from a base LM to state‑of‑the‑art post‑trained performance.

Supervised Finetuning

Balancing diverse instruction data is the core obstacle that our SFT recipe solves.

When we fine‑tune a large language model on instruction data, the biggest practical pain point is how to combine many heterogeneous datasets without letting the most abundant source drown out the rarer, high‑value skills.

We take a pretrained model and continue training it on a curated set of instruction → response pairs, deliberately mixing skill‑specific subsets so that each capability receives enough exposure.

How is SFT different from a naïve “instruction‑tuning” that simply adds one large dataset?

Naïve instruction tuning treats the new data as a single monolith, so the model’s gradients are dominated by whatever source is biggest. Our SFT recipe explicitly splits the data by skill, filters and re‑weights each slice, and even merges skill‑focused checkpoints, guaranteeing that each capability receives a comparable training signal.

Compute raw proportions: Math 20 %, Code 30 %, Chat 50 %.

Down‑sample Chat to 2 000 examples (its proportion drops to 20 %).

Keep Math and Code unchanged; the mix now contains 2 000 + 3 000 + 2 000 = 7 000 examples.

Duplicate the Math bucket once more (adding another 2 000) to reach 9 000 examples.

Sample 1 000 random examples from the combined pool to reach the target size of 10 000.

Shuffle the final 10 000 examples and feed them to the trainer.

Balancing the proportions forces the model to see each skill roughly equally often, which mitigates the risk that abundant chat data would otherwise suppress learning on the scarcer math and code skills.

Armed with this balanced mix, we ran a series of ablations—removing WildChat, safety data, persona data, or math data—to quantify each component’s contribution.

**Figure 2.** The TULU 3 final SFT mix by source and length of the prompt plus completion in tokens (using the Llama 3 tokenizer). Compare this distribution to previous open SFT training datasets in Fig. 26. Datasets with the most instances are on the bottom of the histogram.

**Figure 3.** Average and selected skill-specific performance from training Llama 3.1 8B on our initial TULU 2 SFT mix, and our intermediate and final TULU 3 SFT mixes. Intermediate mixes 1, 2, and 3 were the result of adding new datasets to improve performance. Intermediate mixes 4 and 5 were the result of running multiple rounds of decontamination, causing small drops in performance.

Training the final 8B model required 6 hours on 32 × 8 × H100 GPUs; the 70B model took 50 hours on 64 GPUs, both using an effective batch size of 128 and a maximum sequence length of 4 096 tokens.

**Figure 4.** Average and skill-specific performance on stratified subsamples of our final SFT mix. We find that our full mix performs best overall.

Batch Aggregation

Fixing token‑averaged loss by switching to a sum loss restores proper sample weighting.

The Open‑Instruct Framework builds instruction‑response datasets for supervised finetuning. Its SFT models suffered a systematic gap because the default Transformer loss averages over all tokens, including padding.

Mean‑loss treats every token equally, so a batch that mixes short and long examples lets the longer one dominate; a sum‑loss removes the token denominator, giving each example the same total weight.

Mean‑loss: $L_{\text{mean}} = \frac{0.6+0.4}{2+1}= \frac{1.0}{3}\approx0.33$.

Sum‑loss: $L_{\text{sum}} = \frac{0.6}{2} + \frac{0.4}{1}=0.30+0.40=0.70$.

Under mean‑loss, Example A (twice as long) contributes only twice the loss of Example B, but after scaling the contributions become $0.30$ vs $0.40$, equalising their influence.

Sum‑loss removes the token‑count bias, so short and long examples drive the optimizer with comparable strength.

Why isn’t simply scaling the learning rate enough to fix the token‑averaging bias?

Scaling the learning rate changes the step size uniformly but does not alter how the loss aggregates across examples. The bias originates from the denominator that weights each token, not from gradient magnitude; only removing that denominator (i.e., using a sum loss) equalises sample contributions.

With the sum‑loss and the tuned $5.0\times10^{-6}$ learning rate, the Open‑Instruct SFT models match or exceed the performance of competing pipelines while training for only two epochs.

Data Ablations

Ablation results reveal which data choices most boost DPO performance.

We evaluate a suite of data‑centric ablations to isolate which design choices most affect downstream DPO performance.

Increasing the number of unique prompts yields a +4 % average gain across benchmarks.

Figure 8 shows a monotonic rise in performance as the unique‑prompt pool grows from 5 % to 100 % of the dataset.

Duplicating prompts (64 k → 383 k instances) provides essentially no gain, with a –0.2 % drop on average.

Figure 9 reports comparable scores for the 64 k and 383 k sets and slight degradations on DROP, GSM8k, and AlpacaEval.

Training with unused prompts (new data) outperforms reusing SFT prompts by +1 % on average.

Figure 10 compares “DPO w. Reused Prompts” (59.8) to “DPO w. New Prompts” (61.0) on the aggregated metric.

Adding on‑policy data raises the aggregated downstream score by +0.7 % relative to a purely off‑policy mix.

Figure 11 reports 58.2 (initial), 60.0 (off‑policy), 60.7 (on‑policy), and 61.2 (combined); the on‑policy increment is +0.7 % over off‑policy.

Among LLM judges, GPT‑4o achieves the highest average (57.3), edging out the next best by +0.1 %.

Table 17 lists GPT‑4o at 57.3 versus Llama 3.1 405 B at 57.2 and GPT‑4 Turbo at 57.1.

The final preference mix improves the 70 B model by +3.3 % over UltraFeedback alone.

Text states “the improvement is greater for the 70 B model (+3.3 vs. +1.8).”

Incorporating Persona IF raises average performance by +0.5 % and IFEval by +2 %.

Figure 13 shows the “+ Persona IF” bar modestly above the baseline across all metrics.

IF‑augmented‑verified boosts IFEval by +0.9 % but reduces average score by –0.3 %.

Figure 14 reports a slight IFEval increase for the “+ IF‑augmented‑verified” condition while the average line dips.

Regenerating original preference datasets via the synthetic pipeline adds +1.2 % to the aggregated score.

Figure 15 shows regenerated bars (teal) consistently above the original (pink) across MultiPref, Helpsteer2, and UltraFeedback.

**Figure 8.** Effect of scaling the size of the preference dataset, specifically the number of unique prompts, on downstream DPO model performance (AE: AlpacaEval).

**Figure 9** Effect of scaling a preference dataset by duplicating prompts on downstream DPO performance using the Ultrafeedback dataset. All sizes have the same number of unique prompts (64k).

**Figure 10** Effect of reusing prompts from SFT mix and new prompts from the same datasets subsampled for the SFT dataset mix.

**Figure 11.** Effect of including on-policy data during the Response Generation stage of the synthetic preference data pipeline on downstream DPO model performance.

**Table 17.** Performance of DPO models trained on preference annotations by different LLM judges. Due to the proximity of the numbers, we have not bolded the max per evaluation.

**Figure 13.** Adding persona preference data to the SFT Reused mix for DPO.

**Figure 14.** Performance of different IF-targeted preference mixes, average and IFEval. Best here consists of our final best mix for the 8B model (minus Persona-IF).

**Figure 15.** Comparing the use of the original completions to regenerating completions using our synthetic preference pipeline.

**Table 18.** Hyperparameters and algorithms examined for DPO tuning. We use UltraFeedback as the training dataset in all cases, and train on top of an early TULU 3 version. DPO-norm refers to the length-normalized DPO variant proposed in Meng et al. (2024). We explore hyperparameters suggested by prior work (Meng et al., 2024; Ivison et al., 2023). For PPO, we train reward models on UltraFeedback and reuse prompts during online training, following the hyperparameters in Ivison et al. (2024). We find that length-normalized DPO performs best overall.

**Table 19.** Learning rate ablations for the 70B DPO model, for two different preference mixes: Mix 1: Tülu-3-Persona-IF, Tulu-3-Helpsteer2, Ultrafeedback, Tulu-3-SFT-reused (On-policy), Mix 2: Best 70B Mix (both trained on an older SFT base).

**Table 20.** Final DPO Training Hyperparameters. We use the length-normalized variant of DPO proposed in Meng et al. (2024).

Collectively, these ablations guide the construction of the Tülu 3 preference mix: prioritize unique prompts, incorporate on‑policy data, select high‑quality judges like GPT‑4o, and apply length‑normalized DPO with the identified learning‑rate schedule.

Reinforcement Learning with Verifiable Rewards

RLVR replaces a learned reward model with a deterministic verifier to train language models.

Standard RLHF relies on a learned reward model, which is opaque and costly to maintain. When the task admits a deterministic check—e.g., a math answer or a constraint‑satisfied instruction—this extra model is unnecessary.

RLVR treats a language model as a policy that receives a binary reward only when its generated answer can be verified as correct, eliminating the need for a learned reward model.

The policy samples a completion “5”.

The verifier evaluates $v(\text{prompt},\text{"5"})$ and returns $\alpha=10$ because $5$ matches the ground truth.

PPO computes the advantage $A = 10 - \beta\,\text{KL}$ and updates $\theta$ toward actions that produce “5”.

If the policy had output “4”, $v$ would return $0$, yielding a negative advantage after the KL term, pushing the policy away from “4”.

Because the reward is either $\alpha$ or $0$, the variance of the learning signal is zero, which dramatically stabilizes PPO updates compared to stochastic reward‑model scores.

How does RLVR differ from standard Direct Preference Optimization (DPO)?

DPO learns a reward model from pairwise preferences and then optimizes a policy against that model. RLVR skips the reward‑model step entirely: it plugs a deterministic verifier into the PPO objective, so the policy is trained directly on a binary correctness signal.

**Figure 18** An overview of how Reinforcement Learning with Verifiable Rewards (RLVR) works. We sample completions from a policy model given a set of prompts, and verify their correctness using a deterministic function. If the answer is verifiably correct, we provide reward of $\alpha$, otherwise 0. We then train against this reward using PPO.

RLVR data consist of prompts paired with domain‑specific verifiers. For GSM8K we extract the final numeric answer; for MATH we follow the “flex” evaluation; for IFEval we enforce constraint templates.

**Table 22.** Summary of our verifiable prompt dataset. New datasets released with TÜLU 3 are color-coded for emphasis.

Sample a prompt $x$ from the mixed dataset.

Generate a completion $y\sim\pi_\theta(x)$.

Apply the verifier $v(x,y)$ to obtain reward $r\in\{0,\alpha\}$.

Compute the advantage $A = r - \beta\,\text{KL}[\pi_\theta\|\pi_{\text{ref}}]$.

Perform a PPO policy‑gradient update using $A$ and update the value head.

Periodically shuffle the prompt pool and repeat for many epochs.

Implementation details that proved crucial: (1) initialize the value model from a general reward model; (2) disable dropout for deterministic log‑probabilities; (3) penalize non‑EOS completions with –10; (4) whiten advantages; (5) train on the SFT dataset while shuffling epochs.

**Table 21.** The hyperparameters of PPO used for 1) optimizing against a general RM and 2) optimizing against the verifiable reward function. The differences between the hyperparameters are highlighted. The final 8B RLVR model used $\beta = 0.05$ and $\omega = 0.0$; the final 70B RLVR model used $\beta = 0.07$ and $\omega = 0.07$.

We evaluate four experimental axes: (i) per‑task sweeps of $\beta$; (ii) value‑model initialization ablation; (iii) adding verifiable rewards on top of reward‑model scores; (iv) starting from a weaker SFT checkpoint.

**Figure 16.** The average scores of PPO runs with different learning rate warm-up ratios $\omega$, KL penalty coefficient $\beta$. PPO can get similar (though slightly lower) average scores as DPO.

**Figure 19.** The top three rows show RLVR's verifiable rewards, KL divergence, and response lengths on the train dataset of GSM8K, MATH, and prompts with constraints, when starting from a DPO checkpoint (i.e. an experimental, not final DPO checkpoint). The bottom row shows the corresponding downstream test performance. RLVR can lead to higher verifiable rewards in the train datasets. Importantly, RLVR can also lead to higher scores in the corresponding test dataset, however, an increase in the average score across all evaluations is not guaranteed.

Key findings: (1) RLVR raises verifiable rewards on all three training sets and improves downstream test scores; (2) initializing the value head from a general reward model yields the best GSM8K results; (3) pure verifiable rewards outperform hybrid scores; (4) starting from DPO incurs less KL drift than starting from SFT; (5) overly small $\beta$ leads to over‑optimization and degraded overall scores.

**Figure 20.** The comparison of RLVR's performance on GSM8K between starting from a DPO checkpoint and starting from a weaker SFT checkpoint. We see that starting from both SFT and DPO can lead to the same level of verifiable rewards, but starting from SFT would incur a larger KL compared to starting from DPO when using the same $\beta$.

Infrastructure-wise we use Zero Stage 3 to fit policy, reference, and value models on a single node; inference GPUs host the reference policy, and Ray + vLLM’s PagedAttention reduces fragmentation. Training proceeds asynchronously: inference workers generate trajectories while a separate trainer consumes them, mitigating idle GPU time.

**Figure 17** The peak GPU memory allocated can be reduced by caching the reference policy's logprobs on the preference dataset and doing forward passes separately for the chosen and rejected pairs.

Final runs use the hyperparameters of Table 21 (e.g., $\beta=0.07$, $\omega=0.07$ for the 70 B model) and evaluate every 100 steps (40 steps for 70 B). The best 8 B checkpoints reach 89.4 % on GSM8K and 84.8 % on IFEval, while 70 B models show modest gains on MATH and IFEval but no GSM8K improvement, likely due to saturation.

Evaluation Framework

Open, reproducible suite measuring Tülu 3’s generalization and safety across seen and unseen tasks.

RLVR‑trained Tülu 3 models match or exceed state‑of‑the‑art averages across all benchmark families.

Table 23 shows the 8B RLVR checkpoint achieves an average score of 78.2, surpassing the DPO baseline (75.0) and Llama 3.1 (74.5).

Read the original paper

Open the simplified reader on Paperglide