BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding

BERT enables deep bidirectional language representation by replacing unidirectional training with masked language modeling.

How can we pre-train a deep bidirectional Transformer to capture context from both directions simultaneously, rather than relying on unidirectional or shallowly concatenated representations?

Standard language models are unidirectional, forcing them to process text either left-to-right or right-to-left, which limits their ability to capture full context for complex tasks. BERT solves this by using a "masked language model" objective: it randomly hides tokens in a sequence and forces the model to predict them using both left and right context simultaneously. This deep bidirectionality allows a single pre-trained model to achieve state-of-the-art performance across eleven NLP tasks with minimal task-specific architecture changes.

Paper Primer

The core innovation is the shift from unidirectional generation to bidirectional encoding. By masking 15% of input tokens and training the model to recover them, BERT forces the Transformer to fuse context from both directions at every layer.

BERT significantly outperforms prior state-of-the-art models on the GLUE benchmark.

On the MNLI task, BERT-Large achieved 86.7% accuracy, a 4.6% absolute improvement over the previous best. 7.7% average absolute improvement across the GLUE benchmark suite.

Bidirectional pre-training is superior to shallow concatenation of unidirectional models.

Ablation studies show that a left-to-right model, even when augmented with a BiLSTM, performs worse than the masked bidirectional model on all tested tasks. Significant performance drops were observed when removing the bidirectional constraint or the next-sentence prediction task.

Why does this approach matter for token-level tasks like question answering?

Token-level tasks require fine-grained output that depends on the entire context; unidirectional models are sub-optimal because they cannot incorporate information from both sides of a target token.

How does BERT handle tasks that involve pairs of sentences, such as entailment or question answering?

BERT packs sentence pairs into a single sequence separated by a special token, allowing the self-attention mechanism to model the relationship between the two sentences bidirectionally.

Researchers can now replace heavily-engineered, task-specific architectures with a single, pre-trained bidirectional model that requires only a simple output layer for fine-tuning.

Introduction and Motivation

We expose why unidirectional language models choke pre‑training and how BERT’s deep bidirectional design removes that bottleneck.

Bidirectional Encoder Representations from Transformers (BERT) is introduced as a new language‑representation model that conditions on both left and right context in every layer, unlike recent models that process text strictly left‑to‑right or right‑to‑left.

Two prevailing ways to exploit pre‑trained representations are (i) feature‑based approaches that plug the embeddings into task‑specific architectures, and (ii) fine‑tuning approaches that update all pretrained parameters with a minimal output head. Both share a unidirectional language‑model objective during pre‑training.

The core limitation is the unidirectional bottleneck: a left‑to‑right model can only attend to preceding tokens, which hampers tasks that need full sentence context, such as question answering where the answer token may depend on tokens to its right.

Memory per layer = $16{,}384 \times 4\text{ bytes} \approx 64\text{ KB}$.

With $L=12$ Transformer layers, total attention memory ≈ $12 \times 64\text{ KB} = 768\text{ KB}$.

When batch size $B=32$, the memory demand exceeds $24\text{ MB}$, quickly saturating GPU capacity for longer sequences.

Even modest sequence lengths generate quadratic memory growth, motivating a representation that avoids materialising the full attention map.

Our contributions are threefold: (1) we demonstrate that bidirectional pre‑training yields substantially richer representations than shallow concatenations of left‑to‑right and right‑to‑left models; (2) we show that a single fine‑tuned BERT model eliminates the need for many heavily engineered task‑specific architectures; (3) we achieve new state‑of‑the‑art results on eleven NLP tasks, including a $7.7$‑point absolute gain on GLUE.

The shift from unidirectional to deep bidirectional representation unlocks both richer pre‑training and simpler fine‑tuning.

Prior Approaches to Pre-training

We overview the main pre‑training paradigms and contrast BERT, GPT, and ELMo.

Pre‑training has become a cornerstone of modern NLP. Early work focused on static word embeddings, while later efforts moved to contextual, sentence‑level models.

GPT is a left‑to‑right Transformer that learns to predict each token given all preceding tokens.

ELMo builds separate left‑to‑right and right‑to‑left LSTM language models and concatenates their hidden states to obtain token representations.

**Figure 3.** Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

These prior approaches illustrate the spectrum from static embeddings to fully bidirectional fine‑tuning, setting the stage for BERT’s unified architecture.

The BERT Architecture and Training

We detail BERT’s pre‑training tasks, encoder, and fine‑tuning pipeline.

We now describe BERT’s core mechanisms: a two‑stage pre‑training phase that learns bidirectional representations, and a fine‑tuning phase that adapts the same model to any downstream task.

The model predicts randomly masked tokens by attending to both left and right context, forcing it to encode rich bidirectional information.

Input after masking: $[t_1,\ \text{[MASK]},\ t_3,\ t_4]$.

The encoder produces hidden vectors $h_1,\ h_2,\ h_3,\ h_4$.

Apply a linear projection and softmax to $h_2$ to obtain a probability distribution over the vocabulary.

Compute cross‑entropy loss between the distribution and the true token $t_2$.

Back‑propagate the loss to update all parameters.

Masking forces the model to rely on surrounding context, so the learned representations become truly bidirectional.

Why does MLM replace a masked token with a random word 10 % of the time?

Random replacement prevents the model from over‑fitting to the [MASK] token itself; it learns to handle noisy inputs and retains robustness when the true token appears unchanged.

During pre‑training the model learns to judge whether two sentences follow each other, which equips it with a sense of discourse coherence useful for downstream tasks.

Positive pair input: “[CLS] The dog barked . [SEP] It woke the neighbor . [SEP]”.

Negative pair input: “[CLS] The dog barked . [SEP] The sky is blue . [SEP]”.

Both inputs are processed; the [CLS] hidden vectors $C_{pos}$ and $C_{neg}$ are obtained.

A linear layer computes logits $z = W C + b$ for each pair.

Apply sigmoid to $z$ and compute binary cross‑entropy against the true label (1 for positive, 0 for negative).

NSP forces the encoder to model cross‑sentence dependencies, which later helps tasks like question answering that require reasoning over multiple sentences.

Why not use a more fine‑grained sentence‑order task (e.g., predicting the exact position of $B$)?

Binary NSP is cheap to compute and already provides a strong signal about sentence continuity; finer‑grained ordering would increase training cost without clear downstream benefit.

Stacks of self‑attention layers let every token attend to all others, building contextualized vectors that capture both local and global information.

Project each token into query, key, value vectors for each head (dimensions $4$ → $2$).

Compute attention scores for head 1: $Q^{(1)}K^{(1)\top}$ yields a $3\times3$ matrix.

Apply softmax row‑wise, multiply by $V^{(1)}$ to obtain head 1 outputs.

Repeat for head 2, then concatenate the two head outputs (size $3\times4$) and apply the linear output projection.

Pass through the feed‑forward network (linear → ReLU → linear) and add residual connections.

Repeat the whole sub‑layer sequence for the second encoder layer.

Even with a tiny model, multi‑head attention distributes information across heads, showing how the full‑size BERT can capture diverse linguistic patterns.

How does multi‑head attention differ from simply increasing the dimensionality of a single head?

Multiple heads keep each sub‑space low‑dimensional, which preserves computational efficiency and lets each head specialize; a single high‑dimensional head would mix all cues together and be more expensive.

After pre‑training, the same Transformer encoder is reused for any downstream task by attaching a task‑specific output layer and training all parameters on labeled data.

Feed the concatenated sequence through the encoder; obtain hidden vectors for each token.

Extract the hidden state $C$ of the leading [CLS] token (used for classification) and the hidden states $T_i$ for each token.

Apply a token‑level output layer to $T_i$ to predict start and end positions of the answer span.

Compute cross‑entropy loss against the true start (token “brown”) and end positions.

Back‑propagate to update all parameters, including the pre‑trained encoder.

Fine‑tuning leverages the same bidirectional encoder for both the question and the passage, eliminating the need for a separate cross‑attention module.

Why not freeze the pre‑trained encoder and only train the new output head?

Freezing discards the opportunity to adapt the rich contextual representations to the specifics of the downstream task; end‑to‑end fine‑tuning consistently yields better performance.

**Figure 1.** Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

**Figure 2.** BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

With these components—MLM, NSP, a bidirectional encoder, and a simple fine‑tuning recipe—BERT provides a unified, high‑capacity language model that can be adapted to any downstream NLP task.

Empirical Results

BERT achieves state‑of‑the‑art results across GLUE, SQuAD, and SWAG benchmarks.

BERTLARGE attains a GLUE average score of 80.5, a 7.0% absolute gain over the previous state‑of‑the‑art.

Table 1 shows BERTLARGE scoring 80.5 versus OpenAI GPT’s 72.8.

GLUE aggregates a suite of diverse language‑understanding tasks so a single model’s performance can be compared across classification, similarity, and inference problems.

**Figure 4.** Illustrations of Fine-tuning BERT on Different Tasks.

BERTLARGE (ensemble) reaches 93.2 F1 on SQuAD 1.1 test, surpassing the previous best by 1.5 F1.

Table 2 reports 93.2 F1 for the BERTLARGE ensemble.

BERTLARGE attains 88.0 % test accuracy on SWAG, a 27.1 % improvement over the ESIM+ELMo baseline.

Table 4 shows BERTLARGE at 88.0 % versus ESIM+ELMo’s 59.2 %.

Ablation Studies

BERT’s masked language model trains a deep bidirectional Transformer that sees both left and right context.

We now isolate the contribution of each design choice by removing it and measuring the drop in downstream performance.

Removing the next‑sentence prediction (NSP) hurts QNLI and SQuAD by 0.6 and 1.9 points respectively, while the left‑to‑right objective degrades most tasks dramatically; adding a BiLSTM on top of the left‑to‑right model recovers SQuAD performance but still lags the full bidirectional model on GLUE.

**Figure 5.** Ablation over number of training steps. This shows the MNLI accuracy after fine-tuning, starting from model parameters that have been pre-trained for $k$ steps. The x-axis is the value of $k$.

Scaling the model (more layers, larger hidden size, more attention heads) consistently improves accuracy across all GLUE tasks, mirroring the reduction in language‑model perplexity shown in Table 6.

Feature‑based BERT, when the top four hidden layers are concatenated, reaches 96.1 % Dev F1—only 0.3 % behind full fine‑tuning—demonstrating that BERT works well for both paradigms.

Implementation Details and Supplemental Data

Appendix provides implementation details, experimental setups, and extra ablations for BERT.

The appendix is divided into three parts: additional implementation details (Appendix A), extra experimental information (Appendix B), and further ablation studies (Appendix C).

Appendix A details the pre‑training tasks, the pre‑training pipeline, fine‑tuning settings, and compares BERT with ELMo and OpenAI GPT.

The masked language model uses a mixed masking strategy: 80 % replace with [MASK], 10 % replace with a random token, and 10 % keep unchanged, encouraging the encoder to maintain contextual representations for all tokens.

This procedure limits random replacements to 1.5 % of tokens, which does not degrade language understanding, and because only 15 % of tokens are masked, more pre‑training steps are needed for convergence compared to left‑to‑right models.

The next sentence prediction task creates sentence pairs, labeling them as IsNext or NotNext, using the [CLS] token for classification.

During pre‑training, two text spans are sampled, assigned A and B embeddings, with B being the true next sentence half the time; sequences up to 512 tokens are masked at 15 % after WordPiece tokenization.

Training runs with batch size 256 sequences (128 k tokens), for 1 M steps (~40 epochs) on a 3.3 B‑word corpus, using Adam (lr $1e-4$), dropout 0.1, and GELU activation, on Cloud TPUs (4 for BERTBASE, 16 for BERTLARGE).

To reduce cost, 90 % of steps use sequence length 128 and the remaining 10 % use length 512, allowing the model to learn positional embeddings efficiently.

Fine‑tuning reuses most hyperparameters but varies batch size (16 or 32), learning rate (2e‑5 to 5e‑5), and epochs (2–4), with larger datasets being less sensitive to these choices.

A comparison of BERT, ELMo, and OpenAI GPT highlights architectural differences, pre‑training corpora, token handling, batch sizes, and learning‑rate strategies, showing that BERT’s bidirectionality and dual tasks drive most performance gains.

Figure 4 illustrates fine‑tuning on four tasks, showing that BERT adds only a small task‑specific output layer while reusing its deep encoder.

Appendix B describes the GLUE benchmark datasets (MNLI, QQP, QNLI, SST‑2, CoLA, STS‑B, MRPC, RTE) and notes that WNLI is excluded for fairness.

Table 8 reports an ablation over masking strategies, with rows for different MASK/SAME/RND percentages and columns for dev results on MNLI fine‑tune and NER (fine‑tune and feature‑based).

**Table 8.** Ablation over different masking strategies.

Appendix C presents two additional ablations: the impact of pre‑training step count and the effect of different masking procedures.

Increasing pre‑training steps from 500 k to 1 M yields roughly a 1 % gain on MNLI, confirming that large step counts benefit BERTBASE.

MLM pre‑training converges slightly slower than left‑to‑right training, yet it surpasses the latter in accuracy early in training.

Ablation of masking procedures confirms that the mixed strategy (80 % MASK, 10 % SAME, 10 % RND) balances performance across fine‑tuning and feature‑based NER tasks.

Read the original paper

Open the simplified reader on Paperglide