An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Vision Transformers (ViT) match state-of-the-art image recognition by treating image patches as sequence tokens.

Can we apply a standard Transformer architecture directly to image patches for classification, and does it scale effectively compared to CNNs?

Computer vision relies on convolutional architectures that bake in spatial assumptions, but these models struggle to scale as efficiently as Transformers do in language tasks. The authors treat an image as a sequence of flattened patches, feeding them into a standard Transformer encoder with minimal modifications. This approach discards most image-specific inductive biases, relying instead on large-scale pre-training to learn spatial relationships from scratch. When pre-trained on massive datasets like JFT-300M, this Vision Transformer (ViT) matches or beats state-of-the-art convolutional networks while requiring significantly less compute to train.

Paper Primer

The core move is to reshape an image into a sequence of patches, projecting them into a latent space just like words in a sentence. The model then uses standard multi-head self-attention to integrate information globally across the entire image, even in the lowest layers.

Large-scale pre-training overcomes the lack of inherent spatial inductive bias in Transformers.

ViT-H/14 achieves 88.55% accuracy on ImageNet, outperforming the previous state-of-the-art ResNet-based Big Transfer (BiT) models. ViT models require 2–4× less compute to reach the same performance levels as comparable convolutional networks.

While ViT underperforms on smaller datasets like ImageNet without heavy regularization, it scales effectively as data volume increases. The model's ability to learn spatial topology from scratch is confirmed by its learned position embeddings, which naturally recover 2D image structure.

Why does this approach avoid the quadratic complexity usually associated with applying self-attention to pixels?

By splitting the image into fixed-size patches (e.g., 16x16) rather than individual pixels, the model reduces the input sequence length to a manageable size, making global self-attention computationally feasible.

What is the role of the "hybrid" architecture mentioned in the paper?

Hybrids use a convolutional backbone to extract feature maps before feeding them into the Transformer. They slightly outperform pure ViT at smaller computational budgets, though this advantage disappears as the model scales.

Researchers can now treat image recognition as a sequence-to-sequence problem, shifting the focus from designing complex convolutional architectures to scaling Transformer-based models on large datasets.

Introduction

We expose why CNN priors hinder scaling and how pure Vision Transformers overcome this.

Transformers have reshaped natural‑language processing, yet vision still leans on convolutional networks whose built‑in inductive biases—translation equivariance and locality—constrain how far models can scale.

Because convolutional layers assume images are locally smooth, they excel on modest data but become a bottleneck when we try to train ever larger models; the paper argues that sheer data volume can outweigh these handcrafted priors.

Convolutional networks bake in assumptions about images, which help when data are scarce but hinder progress when we have massive datasets; a pure transformer discards those assumptions and lets scale do the heavy lifting.

Number of patches per side: $224/16=14$.

Total tokens: $14\times14=196$.

Attention matrix entries: $196^2=38{,}416$.

Memory: $38{,}416\times4\text{ bytes}\approx0.15\,$MB.

This tiny example shows that even modest‑resolution images keep the attention map manageable, but memory grows quadratically—doubling the token count quadruples the matrix size, quickly becoming a bottleneck for higher resolutions.

The key shift is moving from CNN‑specific priors to generic Transformer scaling, which unlocks higher performance with far less hand‑crafted design.

Related Work and Context

Contextualizes prior vision Transformers and positions our ResNet baseline.

Vision research has rapidly adopted the Transformer architecture, yet most prior work either modifies attention to reduce its quadratic cost or couples it with convolutional backbones.

ResNet is a deep convolutional network that inserts identity shortcuts so each layer learns a residual correction rather than a full transformation, which keeps gradients flowing and enables very deep models.

Introduced the original multi‑headed self‑attention architecture for machine translation, establishing the foundation for all subsequent Transformer models.

Uses a masked language modeling objective to learn bidirectional representations that can be fine‑tuned for downstream tasks.

Trains a Transformer decoder to predict the next token, scaling up to billions of parameters and demonstrating emergent capabilities.

Applying naïve self‑attention directly to image pixels requires each of the $H\!\times\!W$ pixels to attend to every other pixel, yielding $O((HW)^2)$ memory and compute.

Restricts each query pixel to attend only within a fixed spatial neighborhood, reducing the attention matrix size.

Uses multi‑headed dot‑product self‑attention blocks in place of traditional convolutional layers.

Introduces sparsity patterns (e.g., fixed‑pattern or learned) to approximate global self‑attention with sub‑quadratic cost.

Applies attention within non‑overlapping blocks or along a single spatial axis to cut compute.

Extracts $2\times2$ patches from an image and feeds the full set of patches into a standard Transformer.

Integrates attention modules into convolutional pipelines for classification, detection, video, and multimodal tasks.

Trains a Transformer on sequences of reduced‑resolution pixel values, then uses the learned representations for downstream classification.

Collects hundreds of millions of labeled images (e.g., JFT‑300M, ImageNet‑21k) to push the limits of model performance.

Empirically investigates how convolutional network performance scales with dataset size.

Explores fine‑tuning CNNs pretrained on ImageNet‑21k or JFT‑300M for downstream tasks.

The Vision Transformer Architecture

We detail how ViT turns image patches into a Transformer sequence and processes them.

Standard vision models bake locality and translation equivariance into every layer, which limits their ability to scale to massive data. ViT removes most of that hand‑crafted bias, but the resulting model must still learn spatial structure from scratch.

Each image patch is flattened and linearly projected into the Transformer latent space, then a learned position vector is added.

Flatten patch 1 (top‑left) → vector of length $2^{2}\!\times\!3=12$.

Multiply by $E$ (size $12\times8$) → token $t_{1}\in\mathbb{R}^{8}$.

Add position embedding $e_{\text{pos},1}\in\mathbb{R}^{8}$ → final token $z_{1}=t_{1}+e_{\text{pos},1}$.

Repeat for patches 2‑4, obtaining $z_{2},z_{3},z_{4}$.

Prepend classification token $x_{\text{class}}\in\mathbb{R}^{8}$ → sequence $[x_{\text{class}},z_{1},z_{2},z_{3},z_{4}]$.

The linear projection treats every spatial location identically; all spatial reasoning must therefore emerge from the subsequent self‑attention layers.

How does Patch Embedding differ from a standard convolutional layer?

A convolution slides a kernel over the image, mixing neighboring pixels locally. Patch Embedding flattens each patch independently and applies the same linear map, so there is no weight sharing across spatial locations and no receptive‑field growth beyond the patch size.

Each encoder layer first lets every token attend to all others (global mixing), then refines each token with a small feed‑forward network; layer‑norm and residual shortcuts keep training stable.

LN normalizes both tokens (zero‑mean, unit‑variance per feature).

MSA computes attention scores: each token attends to both, producing weighted sums $a_{\text{CLS}}$ and $a_{1}$.

Add residual: $z_{\text{MSA}} = [\text{CLS}+a_{\text{CLS}},\; \text{patch}+a_{1}]$.

Second LN normalizes $z_{\text{MSA}}$.

MLP expands each token to $4D=16$, applies GELU, projects back to $4$.

Add second residual to obtain final layer output $z^{(1)}$.

Even with only two tokens, the attention step lets the classification token instantly gather information from the patch, illustrating the global mixing property.

Why does ViT use pre‑norm (LN before each block) instead of the original post‑norm?

Pre‑norm keeps the signal entering each sub‑module well‑scaled, which prevents the exploding/vanishing gradients that can arise when many layers are stacked. Empirically it yields more stable training for deep Transformers.

ViT stacks $L$ identical encoder layers on the patch token sequence, then reads out the classification token with a simple MLP head for image‑level predictions.

Patch Embedding produces one token $t_{1}$ (plus $x_{\text{class}}$) → sequence $[x_{\text{class}},t_{1}]$.

Add position embeddings (two rows) → $z_{0}$.

Run one encoder layer: MSA mixes $x_{\text{class}}$ and $t_{1}$, residual adds back.

MLP refines each token, residual adds again.

Extract the updated $x_{\text{class}}$ and apply the linear head to obtain logits for $K$ classes.

Even with a single patch the classification token can attend to it, showing that the model does not rely on a spatial grid beyond the initial embedding.

Is the classification token just another token, or does it have a special role?

It is structurally identical to other tokens (same dimension, participates in attention), but its final representation is the only one read by the downstream linear head. During training it learns to gather global information because only its vector is used for classification.

**Figure 1.** Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. The illustration of the Transformer encoder was inspired by Vaswani et al. (2017).

Experimental Setup

Experimental setup compares ViT, ResNet, and hybrids across datasets and compute budgets.

We evaluate three families—standard ResNet (BiT), pure Vision Transformer (ViT), and a hybrid that injects pixel‑level patches into ViT—on a suite of image classification benchmarks.

Pre‑training datasets span three scales: ImageNet‑1k (1.3 M images, 1 k classes), its superset ImageNet‑21k (14 M images, 21 k classes), and JFT (303 M high‑resolution images, 18 k classes). All are de‑duplicated against downstream test sets.

ViT configurations follow the BERT family: “Base”, “Large”, and “Huge”, denoted ViT‑B/16, ViT‑L/16, ViT‑H/14. Patch size controls sequence length—smaller patches yield longer sequences and higher compute.

ResNet baselines replace Batch Normalization with Group Normalization and use standardized convolutions, matching the “BiT” variant. Hybrids feed the output of a ResNet stage (either stage 4 or an extended stage 3) into ViT with a patch size of one pixel, producing a 4× longer token sequence.

All models are pre‑trained with Adam ($\beta$₁=0.9, $\beta$₂=0.999), batch size 4096, and a high weight decay of 0.1. Fine‑tuning uses SGD with momentum, batch size 512, and for ViT‑L/16 and ViT‑H/14 we increase input resolution (512 px and 518 px respectively) and apply Polyak‑averaging (factor 0.9999).

We report two families of downstream metrics: (1) fine‑tuning accuracy after full model adaptation, and (2) few‑shot linear regression accuracy obtained by solving a closed‑form ridge regression on frozen representations.

Performance Benchmarks

Vision Transformers replace convolutional biases with patch‑based tokens, enabling large‑scale training.

Vision Transformers replace convolutional inductive biases with patch‑based token sequences, enabling the same scaling laws that drive language models.

ViT‑H/14 reaches 88.55% ImageNet top‑1 accuracy while using only 2.5k TPUv3‑core‑days, a 0.81% gain over the BiT‑L baseline that required 9.9k core‑days.

Table 2 reports 88.55 ± 0.04% vs 87.54 ± 0.02% and compute 2.5k vs 9.9k.

BiT fine‑tunes a very large ResNet on a massive supervised dataset, then transfers the learned weights to downstream vision tasks.

**Table 2.** Comparison with state of the art on popular image classification benchmarks. We report mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the smaller public ImageNet-21k dataset performs well too. *Slightly improved 88.5% result reported in Touvron et al. (2020).

**Figure 2.** Breakdown of VTAB performance in Natural, Specialized, and Structured task groups

Scaling and Data Requirements

Analyzes how dataset size and compute affect Vision Transformer performance.

We first examine how the amount of pre‑training data influences Vision Transformer (ViT) performance, then conduct a controlled scaling study that isolates compute as the bottleneck.

JFT‑300M is a massive image‑text collection containing roughly 300 million labeled images, used to provide the high‑capacity data needed for large‑scale vision models.

When pre‑trained on increasingly larger datasets (ImageNet → ImageNet‑21k → JFT‑300M) and fine‑tuned on ImageNet, ViT‑Large only surpasses ViT‑Base after seeing JFT‑300M, indicating that the larger model’s advantage materialises only with sufficient data.

**Figure 3.** Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on larger datasets. Similarly, larger ViT variants overtake smaller ones as the dataset grows.

We further train models on random subsets of the JFT‑300M corpus (9 M, 30 M, 90 M, full) with identical hyper‑parameters and report few‑shot linear accuracy. ViTs over‑fit more than ResNets on the 9 M subset, yet outperform ResNets once the subset exceeds ~30 M examples.

The scaling study isolates compute by evaluating all models after pre‑training on the full JFT‑300M for a fixed number of epochs. The model family includes seven ResNets, six Vision Transformers, and five hybrids.

**Figure 5.** Performance versus pre-training compute for different architectures: Vision Transformers, ResNets, and hybrids. Vision Transformers generally outperform ResNets with the same computational budget. Hybrids improve upon pure Transformers for smaller model sizes, but the gap vanishes for larger models.

Vision Transformers reach comparable ImageNet transfer accuracy with roughly 2–4× less pre‑training compute than ResNets.

Figure 5 shows ViT models attaining similar accuracy at lower compute across five downstream datasets.

Hybrid models slightly outperform Vision Transformers at low compute budgets, but the advantage disappears as compute increases.

Figure 5’s left panel (average‑5) shows hybrids ahead of ViTs for the smallest compute points, with the curves converging for larger budgets.

Internal Representation Analysis

We examine how ViT’s internal representations and attention behave across layers.

Understanding ViT requires peering inside its layers: how patches are embedded, how position information is encoded, and how self‑attention spreads information across the image.

We render the attention weights from a given output token back onto the input image, revealing which pixels the model relies on for that decision.

**Figure 6.** Representative examples of attention from the output token to the input space. See Appendix D.7 for details.

**Figure 7.** **Left:** Filters of the initial linear embedding of RGB values of ViT-L/32. **Center:** Similarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position embedding of the patch with the indicated row and column and the position embeddings of all other patches. **Right:** Size of attended area by head and network depth. Each dot shows the mean attention distance across images for one of 16 heads at one layer. See Appendix D.7 for details.

Additional Analyses

We examine optimizer choices, scaling trade‑offs, and architectural limits of ViT.

Vision Transformers drop the convolutional inductive bias of ResNets, enabling scaling laws that favor depth and compute over handcrafted priors.

Adam‑pre‑trained ResNets achieve higher average fine‑tuning accuracy than SGD‑pre‑trained counterparts.

Table 7 shows Adam averages 89.33 % versus 88.79 % for SGD across five datasets.

**Table 7.** Fine-tuning ResNet models pre-trained with Adam and SGD.

**Figure 8.** Scaling different model dimensions of the Vision Transformer.

Replacing the class token with a global‑average‑pooled (GAP) representation works only when the learning‑rate is lowered; otherwise performance collapses.

**Figure 9.** Comparison of class-token and global average pooling classifiers. Both work similarly well, but require different learning-rates.

**Table 8.** Results of the ablation study on positional embeddings with ViT-B/16 model evaluated on ImageNet 5-shot linear.

**Figure 10.** Position embeddings of models trained with different hyperparameters.

**Figure 12.** **Left:** Real wall-clock timings of various architectures across input sizes. ViT models have speed comparable to similar ResNets. **Right:** Largest per-core batch-size fitting on device with various architectures across input sizes. ViT models are clearly more memory-efficient.

**Figure 13.** Performance of Axial-Attention based models, in terms of top-1 accuracy on ImageNet 5-shot linear, versus their speed in terms of number of FLOPs (left) and inference time (left).

**Figure 11.** Size of attended area by head and network depth. Attention distance was computed for 128 example images by averaging the distance between the query pixel and all other pixels, weighted by the attention weight. Each dot shows the mean attention distance across images for one of 16 heads at one layer. Image width is 224 pixels.

**Figure 14.** Further example attention maps as in Figure 6 (random selection).

Self-Attention Mechanism

Technical details of self‑attention and hyper‑parameter settings.

The authors thank colleagues at Google for infrastructure support and discussions that enabled the large‑scale experiments.

A complete bibliography of related work and prior art is provided for the interested reader.

**Table.** Training hyperparameters for different models and datasets.

Self‑attention maps each input token $z_i$ to a weighted combination of all value vectors $v_j$, where the weights derive from the similarity of the corresponding query $q_i$ and key $k_j$.

Multi‑head self‑attention (MSA) runs $k$ independent attention heads in parallel, then concatenates their outputs and projects back to the model dimension.

Training Configuration

Details of training, fine‑tuning, and self‑supervision setups for ViT models.

Table 3 lists the training configurations for each model, highlighting that strong regularization is essential when training from scratch on ImageNet. Dropout is inserted after every dense layer except the qkv‑projections and after adding positional embeddings to patch embeddings. Hybrid models follow the identical setup as their ViT counterparts, and all training runs use a $224 \times 224$ resolution.

Fine‑tuning all ViT models uses SGD with momentum $0.9$ and a modest learning‑rate grid (see Table 4). We reserve $10\%$ of Pets and Flowers, $2\%$ of CIFAR, and $1\%$ of ImageNet as development splits, training on the remaining data before evaluating on the full test set. ResNet and hybrid fine‑tuning mirrors this setup, adding an extra learning‑rate value $0.06$ for ImageNet and also incorporating the Kolesnikov et al. (2020) protocol, with all experiments run at $384 \times 384$ resolution unless noted otherwise.

When adapting ViT to a new dataset we drop the original classification head (two linear layers) and insert a single zero‑initialized linear layer matching the target class count. This simple replacement proves more robust than merely re‑initializing the final layer.

For VTAB we adopt the Kolesnikov et al. (2020) protocol, using a uniform learning rate of $0.01$ and training for $2\,500$ steps as shown in Table 4. The hyperparameters were chosen after a brief sweep over two learning rates and two schedules, selecting the configuration with the highest VTAB score on $200$‑example validation sets. All tasks benefit from a high input resolution of $384 \times 384$, and we omit task‑specific resolution adjustments.

**Table 4.** Hyperparameters for fine-tuning. All models are fine-tuned with cosine learning rate decay, a batch size of 512, no weight decay, and grad clipping at global norm 1. If not mentioned otherwise, fine-tuning resolution is 384.

We employ masked patch prediction, corrupting $50\%$ of patch embeddings by replacing them with a learnable mask token ($80\%$), a random other patch ($10\%$), or leaving them unchanged ($10\%$). The model then predicts the $3$‑bit mean color (512 possible colors) for each corrupted patch from its representation.

The self‑supervised model is trained for $1\,\text{M}$ steps (≈ 14 epochs) on JFT with a batch size of $4\,096$ using Adam, a base learning rate of $2 \times 10^{-4}$, a $10\,\text{k}$‑step warmup, and cosine decay.

We explored three prediction target options: (1) predicting the $3$‑bit mean color, (2) predicting a $4 \times 4$ downscaled version of the $16 \times 16$ patch with $3$‑bit colors (16 predictions), and (3) regressing the full patch with an L2 loss (256 regressions). All performed well, though the L2 variant was slightly inferior, so we report results for option 1, which yielded the best few‑shot performance.

A $15\%$ corruption rate, as used by Devlin et al. (2019), was also tested but produced marginally lower few‑shot scores.

Our experiments show that massive pretraining on JFT is not required; performance plateaus after roughly $100\,\text{k}$ steps, indicating diminishing returns. Comparable gains are achieved when pretraining on ImageNet alone.

Supplementary Results

Additional tables expand the transfer and scaling results referenced in the main text.

We provide the full numeric tables that underpin the summary figures in the paper, allowing precise comparison of Vision Transformer (ViT) variants and baselines across multiple downstream tasks.

**Table 5.** Top1 accuracy (in %) of Vision Transformer on various datasets when pre-trained on ImageNet, ImageNet-21k or JFT300M. These values correspond to Figure 3 in the main text. Models are fine-tuned at 384 resolution. Note that the ImageNet results are computed without additional techniques (Polyak averaging and 512 resolution images) used to achieve results in Table 2.

Table 6 expands the analysis to include model‑scaling experiments, reporting both transfer accuracy on the same downstream datasets and the associated pre‑training compute measured in exaFLOPs for each ViT and ResNet configuration.

The compute budget ranges from 55 exaFLOPs for ViT‑B/32 up to 1 567 exaFLOPs for the largest ViT‑H/14, while ResNet variants span 4 262 exaFLOPs (ResNet50×1) to 33 306 exaFLOPs (ResNet200×3), illustrating the trade‑off between model size, pre‑training effort, and downstream performance.

Read the original paper

Open the simplified reader on Paperglide