Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick

A promptable segmentation model and 1.1B-mask dataset that enables zero-shot transfer to diverse vision tasks.

How can we build a foundation model for image segmentation that generalizes to unseen objects and tasks via prompting?

Computer vision lacks a general-purpose segmentation model because existing datasets are too small and task-specific to support broad generalization. The authors introduce the Segment Anything Model (SAM): a promptable architecture that decouples image encoding from a lightweight, real-time mask decoder, trained on a massive, automatically generated dataset. This approach achieves strong zero-shot performance across diverse tasks, often rivaling or exceeding fully supervised models on unseen data distributions.

Paper Primer

The core move is the "data engine": a three-stage loop where the model assists human annotators, then generates its own masks, and finally scales to 11 million images. By training on this massive, automatically curated dataset (SA-1B), the model learns to output valid masks for any prompt—points, boxes, or text—without needing task-specific fine-tuning.

SAM achieves high-quality zero-shot segmentation across diverse, unseen image distributions.

Evaluated on a suite of 23 datasets, SAM consistently produces masks rated higher by humans than those from the strongest interactive segmentation baselines. The SA-1B dataset contains 1.1 billion masks, representing a 400x increase in scale over previous segmentation datasets.

The model is ambiguity-aware: it predicts multiple potential masks for a single prompt (e.g., a point on a shirt could refer to the shirt, the person, or the whole figure) and ranks them by confidence, ensuring it remains useful even when the user's intent is underspecified.

Why is this model "promptable" rather than just a standard segmenter?

Promptability allows the model to act as a modular component in larger systems; by accepting points or boxes as inputs, it can be composed with other tools (like object detectors) to solve new tasks without retraining.

Does the reliance on automatically generated masks hurt performance?

No; the authors found that training on the fully automatic masks alone yields performance nearly identical to training on the combined manual and semi-automatic data, validating the quality of the data engine.

Segmentation is now a foundation model task: researchers can treat SAM as a plug-and-play module for downstream vision systems, bypassing the need for task-specific training data.

Introduction

We expose why segmentation needs a promptable foundation model and a massive dataset.

Segmentation models today are shackled to narrow, task‑specific training pipelines, which prevents them from generalizing to new objects or domains without costly re‑annotation.

For one image the mask occupies 2 bytes.

For 1 billion such images the total storage is 2 GB.

If we used a higher‑resolution 1024 × 1024 image (≈1 M pixels), a single mask would need ~125 KB, and 1 B masks would exceed 100 TB.

This scaling argument shows that naïvely collecting masks at full resolution is infeasible, motivating a data‑engine that leverages a model to generate masks on the fly.

To break this bottleneck the paper proposes a promptable foundation model—Segment Anything Model (SAM)—trained on a massive, automatically generated dataset, enabling zero‑shot generalization to arbitrary prompts and image distributions.

The solution rests on three tightly coupled components: (1) a promptable segmentation task that can be expressed with points, boxes, or text; (2) an efficient SAM architecture that produces masks in real time; and (3) a data engine that iteratively uses the model to annotate images and then retrains on the newly collected masks.

**Figure 1.** We aim to build a foundation model for segmentation by introducing three interconnected components: a promptable segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.

The core shift is moving from narrowly trained, task‑specific segmenters to a promptable foundation model that learns from a billion‑mask dataset and generalizes zero‑shot.

Model Architecture

We describe a three‑stage model that merges image and prompt information in a fast, ambiguity‑aware mask decoder.

The promptable segmentation task demands a model that can ingest arbitrary prompts, answer instantly for interactive use, and gracefully handle ambiguous cues such as a point on a shirt that could belong to the shirt or the person wearing it.

Think of the mask decoder as a tiny mixer that quickly blends the image’s “flavor” (its embedding) with the prompt’s “recipe” (its embedding) to bake a segmentation mask, and it can bake several candidate masks at once when the prompt is vague.

How does this mask decoder differ from a conventional decoder that processes the full image for every prompt?

A conventional decoder would re‑run a heavy backbone on the image each time a new prompt arrives, incurring the full image‑to‑feature cost repeatedly. Our decoder reuses the pre‑computed image embedding $E_I$, so only the lightweight prompt‑embedding and MLP are executed per prompt, yielding orders‑of‑magnitude speedup.

Concatenate $E_I$ and $E_P$ → $[2,\,-1,\,3]$.

Linear transform: $z = W\cdot[2,\,-1,\,3]^{\top} + b = [2\cdot1 + (-1)\cdot0 + 3\cdot2,\; 2\cdot(-1) + (-1)\cdot1 + 3\cdot0]^{\top} + [0,1] = [8,\,-3]^{\top} + [0,1] = [8,\,-2]$.

Apply sigmoid to obtain mask logits for mask 1: $\sigma(8)\approx0.99997$, $\sigma(-2)\approx0.119$ → coarse mask $M_1$.

For mask 2 we use a second linear head $W^{(2)}$ (here identical for illustration) producing the same logits; in practice the heads differ, yielding a second plausible mask.

The decoder can output multiple masks without recomputing $E_I$, so ambiguity is handled by cheap parallel heads rather than expensive re‑encoding of the image.

Compute the image embedding $E_I$ once with the image encoder.

Encode the incoming prompt (point, box, mask, or text) into $E_P$.

Concatenate $E_I$ and $E_P$, feed through the mask decoder MLP.

Obtain $K$ mask logits; apply sigmoid/softmax to produce binary masks.

Select the most confident mask or present all $K$ masks to the user.

By structuring SAM as an image encoder plus a fast, prompt‑aware mask decoder, we satisfy all three design constraints: flexible prompting, real‑time amortized inference, and built‑in ambiguity handling.

The Promptable Segmentation Task

Defines the promptable segmentation task and its pre‑training scheme for a universal segmentation foundation.

The core obstacle is that existing segmentation models are trained on narrow, task‑specific data, which prevents them from handling arbitrary prompts at inference.

Like next‑token prediction in language models, the task asks the model to produce a segmentation mask given any user‑provided prompt that indicates what to segment.

How is this task different from traditional interactive segmentation?

Interactive segmentation expects the model to converge to a correct mask after a sequence of user corrections; promptable segmentation must produce a plausible mask *immediately* for any single prompt, even when the cue is ambiguous.

A prompt can be any piece of information—points, boxes, masks, or text—that tells the model which region of the image to extract.

Why can a single model handle such diverse prompt types without separate heads?

All prompts are first embedded into a common “prompt token” space; the mask decoder attends to these tokens in the same way it attends to image tokens, so the same attention machinery can interpret points, boxes, masks, or text uniformly.

Encode the image and the three prompt tokens (two points, one box) into a shared latent space.

Each prompt token attends to the image tokens; the points strongly activate the four pixels of the red square, while the box activates a $3\times3$ region.

The mask decoder aggregates the attended features and produces a binary mask covering the $2\times2$ red square (the intersection of point and box cues).

Because the prompt is unambiguous, the loss is low; if the box were larger, the decoder would still output a plausible mask (e.g., the whole $3\times3$ region) satisfying the validity rule.

The example shows that the model can fuse heterogeneous cues—points and a box—into a single coherent mask without needing separate processing pipelines.

**Figure 3.** Each column shows 3 valid masks generated by SAM from a single ambiguous point prompt (green circle).

These definitions establish the Segment Anything Task as a universal, prompt‑driven segmentation problem that can be pre‑trained once and applied zero‑shot to any downstream segmentation need.

The Segment Anything Model (SAM)

SAM unifies a single image embedding with flexible prompts to produce masks in real‑time.

Task‑specific segmentation models require retraining for each new prompt type, preventing interactive use. SAM solves this by decoupling image processing from prompt handling, enabling on‑the‑fly queries.

Compute a dense image embedding once, then any combination of point, box, text, or mask prompts is merged via a lightweight encoder and turned into masks by a single decoder block.

Image encoder produces the 4×4 embedding matrix $E$.

Prompt encoder maps the point to vector $p$ and adds the same positional encoding used for E.

Mask decoder cross‑attends p to E, updating both the prompt token and the image tokens.

After two attention layers, the decoder upsamples the refined embedding and applies a linear classifier to each spatial location, yielding a binary mask that lights up only the (2, 2) pixel.

The same pre‑computed embedding E can be reused for any number of prompts, so inference cost grows with prompt complexity, not with image size.

How does SAM’s prompt encoder differ from simply concatenating raw prompt inputs to the image features?

Instead of raw pixels, SAM maps each prompt type to a learned embedding and adds a positional encoding, allowing the decoder’s cross‑attention to fuse prompt and visual information efficiently. A naïve concatenation would explode dimensionality and prevent the model from learning joint attention patterns.

Initialize prompt tokens from the prompt encoder and image tokens from the cached embedding.

Prompt‑to‑image cross‑attention: each prompt token attends to all image tokens, updating the prompt representation.

Image‑to‑prompt cross‑attention: each image token attends to the updated prompt tokens, refining the visual representation.

Repeat the two cross‑attention steps for a second decoder block.

Upsample the refined image tokens to the original resolution.

Apply a dynamic linear classifier (derived from the output token) to each pixel, producing foreground probabilities for each mask.

Optionally compute an IoU confidence score for each mask.

Inference of SAM given a cached image embedding and a set of prompts.

**Figure 4.** Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.

**Figure 14:** Details of the lightweight mask decoder. A two-layer decoder updates both the image embedding and prompt tokens via cross-attention. Then the image embedding is upscaled, from which the updated output tokens are used to dynamically predict masks. (Not illustrated for figure clarity: At every attention layer, positional encodings are added to the image embedding, and the entire original prompt token (including position encoding) is re-added to the token queries and keys.)

Because the image embedding is cached, the prompt encoder and mask decoder together run in roughly 50 ms on a CPU, enabling seamless, real‑time interactive prompting directly in a web browser.

The Data Engine

Three-stage pipeline that turns manual clicks into billions of masks.

Segmentation research stalls because high‑quality masks are scarce; the authors therefore built a Data Engine that harvests masks at scale while continuously improving the model.

The engine works like an assembly line: objects start in a fully manual station, then move to a semi‑automatic station where machines pre‑fill easy parts, and finally to a fully automatic station that runs without human hands.

Stage 1: three manual clicks → three initial masks; brush corrects boundary errors.

Stage 2: detector auto‑adds mask for C; annotator only edits A and B.

Stage 3: grid point (row 1, col 2) yields three nested masks; stability test keeps all three.

NMS removes duplicate masks, leaving a final set of five masks for the two images.

The three‑stage flow turns a fully manual bottleneck into a largely autonomous pipeline, and the stability test guarantees that only masks robust to small probability shifts are kept.

How does the Data Engine differ from a standard interactive‑segmentation loop?

In a classic loop the human draws every mask from scratch; here the human only corrects or adds masks while the model continuously proposes and improves masks, and later stages eliminate the human entirely.

Stage 1’s real‑time SAM embeddings let annotators see mask updates instantly, cutting average annotation time from 34 s to 14 s—a 6.5× speed‑up over COCO‑style mask labeling.

Stage 2’s bounding‑box detector, trained on the first‑stage masks, supplies “confident” masks that focus annotators on less obvious objects, raising the average masks per image from 44 to 72.

Stage 3’s ambiguity‑aware model queries a 32 × 32 grid, predicts hierarchical masks, filters them by IoU confidence and the $0.5$ ± $δ$ stability criterion, then applies NMS and multi‑scale crops to retain high‑quality small masks.

The SA-1B Dataset

SA-1B provides unprecedented scale and diversity for promptable segmentation.

SA-1B dwarfs existing segmentation datasets.

Figure 6 (legend) shows SA-1B contains $11\text{M}$ images and $1.1\text{B}$ masks—$11\times$ more images and $400\times$ more masks than Open Images.

A massive, automatically annotated segmentation dataset (11 M images, 1.1 B masks) that fuels a promptable foundation model.

How does SA-1B differ from earlier large segmentation datasets such as Open Images?

SA-1B is an order of magnitude larger (11 M vs 1 M images) and provides 400 × more masks, with an average of 100 masks per image versus < 10 for Open Images. Moreover, its masks are generated automatically by a high‑capacity model and have been validated to reach > 90 % IoU with human edits, whereas prior datasets rely on manual annotation.

**Figure 2.** Example images with overlaid masks from our newly introduced dataset, **SA-1B**. SA-1B contains 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. These masks were annotated *fully automatically* by SAM, and as we verify by human ratings and numerous experiments, are of high quality and diversity. We group images by number of masks per image for visualization (there are ~100 masks per image on average).

**Figure 5.** Image-size normalized mask center distributions.

**Figure 6.** Dataset mask properties. The legend references the number of images and masks in each dataset. Note, that SA-1B has 11x more images and 400x more masks than the largest existing segmentation dataset Open Images [60].

**Figure 7.** Estimated geographic distribution of SA-1B images. Most of the world’s countries have more than 1000 images in SA-1B, and the three countries with the most images are from different parts of the world.

The scale (11 M images, 1.1 B masks) and diversity of SA-1B far surpass prior segmentation datasets, enabling foundation‑level training for promptable segmentation.

Responsible AI Analysis

We assess geographic, income, and demographic fairness of SA‑1B and SAM.

We first examine how SA‑1B’s geographic and income distribution compares to two widely used benchmarks, COCO and Open Images.

Table 1 shows that SA‑1B contains at least 28 million masks per region—about ten times more than any prior dataset—while the average number of masks per image (94–108) remains stable across regions and income brackets.

Next we evaluate SAM’s segmentation quality on people across protected attributes using the MIAP dataset for gender and age and a proprietary skin‑tone collection.

All confidence intervals overlap within each attribute group, except for the older vs. middle age comparison where the older group shows a higher mean.

Overall, SAM exhibits comparable performance across gender, age, and skin‑tone categories, suggesting limited demographic bias in the segmentation task itself, though we note the large variance for older individuals.

Zero-Shot Transfer Experiments

Zero-shot single‑point prompting yields high‑quality masks across diverse datasets.

Promptable segmentation lets a model produce a mask from any user prompt; here we test the extreme case of a single foreground point.

Zero‑shot transfer means applying a model to a new task or dataset without any task‑specific training, relying only on the prompts it was trained to understand.

SAM outperforms the strong interactive segmenter RITM on zero‑shot single‑point mask quality, achieving up to +47 IoU improvement on individual datasets and higher human ratings.

SAM yields higher $mIoU$ on 16 of 23 datasets and annotators rate its masks 7–9 versus lower scores for RITM (see Fig. 9b).

**Figure 8.** Samples from the 23 diverse segmentation datasets used to evaluate SAM's zero-shot transfer capabilities.

Zero-Shot Edge Detection

Zero-shot SAM proposals rival supervised baselines on LVIS.

SAM outperforms ViTDet‑H on rare object proposals, achieving 65.8% mask AR@1000.

Table 4 shows SAM reaches 65.8% AR@1000 on the rare category, surpassing ViTDet‑H’s 58.3%.

SAM also beats ViTDet‑H on medium, large, common, and rare objects, while lagging on small and frequent objects where a supervised detector can learn dataset‑specific biases.

Zero-Shot Instance Segmentation

Zero-shot instance segmentation shows SAM narrows the gap to ViTDet and wins human quality ratings.

SAM lags ViTDet by 20.2 AP on COCO but receives higher human quality ratings.

Table 5 reports SAM AP 30.8 versus ViTDet‑H 51.0 on COCO; Figure 11 shows SAM’s mean rating exceeds ViTDet’s.

ViTDet is a detector that predicts bounding boxes and class scores, which can be fed to a mask decoder for instance segmentation.

How does ViTDet differ from SAM’s mask decoder?

ViTDet produces bounding boxes only; SAM takes those boxes as prompts and synthesizes pixel‑accurate masks via its promptable segmentation head.

**Table 5.** Instance segmentation results. SAM is prompted with ViTDet boxes to do zero-shot segmentation. The fully-supervised ViTDet outperforms SAM, but the gap shrinks on the higher-quality LVIS masks. Interestingly, SAM outperforms ViTDet according to human ratings (see Fig. 11).

**Figure 11.** Mask quality rating distribution from our human study for ViTDet and SAM, both applied to LVIS ground truth boxes. We also report LVIS and COCO ground truth quality. The legend shows rating means and 95% confidence intervals. Despite its lower AP (Table 5), SAM has higher ratings than ViTDet, suggesting that ViTDet exploits biases in the COCO and LVIS training data.

**Figure 16.** Zero-shot instance segmentation on LVIS v1. SAM produces higher quality masks than ViTDet. As a zero-shot model, SAM does not have the opportunity to learn specific training data biases; see top-right as an example where SAM makes a modal prediction, whereas the ground truth in LVIS is amodal given that mask annotations in LVIS have no holes.

Text-to-Mask and Ablations

Zero-shot text prompts are evaluated and the impact of data and model choices is quantified.

We extend SAM to the higher‑level task of segmenting objects from free‑form text. During training we replace the first prompt with a CLIP image embedding extracted from each manually collected mask (area $>100^2$). Because CLIP’s image embeddings are aligned with its text embeddings, at inference time we can feed a CLIP text embedding directly to SAM.

**Figure 12.** Zero-shot text-to-mask. SAM can work with simple and nuanced text prompts. When SAM fails to make a correct prediction, an additional point prompt can help.

Training with only the automatically generated masks reduces mIoU by roughly 0.5 compared to using the full mixture of manual, semi‑automatic, and automatic masks.

When the automatic‑only setup is used, SAM’s mIoU is ≈0.5 lower than the best result obtained with all three data‑engine stages combined.

Each data‑engine stage (manual → semi‑automatic → automatic) incrementally improves mIoU. Oversampling the scarce manual and semi‑automatic masks by a factor of ten yields the best results, but the authors adopt the automatic‑only configuration by default to keep training simple.

**Figure 13.** Ablation studies of our data engine stages, image encoder scaling, and training data scaling. (Left) Each data engine stage leads to improvements on our 23 dataset suite, and training with only the automatic data (our default) yields similar results to using data from all three stages. (Middle) SAM trained with ~10% of SA-1B and full SA-1B is comparable. We train with all 11M images by default, but using 1M images is a reasonable practical setting. (Right) Scaling SAM’s image encoder shows meaningful, yet saturating gains. Nevertheless, smaller image encoders may be preferred in certain settings.

Training on a reduced subset of SA‑1B (0.1 M images) causes a pronounced mIoU drop across all settings, whereas using 1 M images (≈10 % of the full dataset) yields results comparable to training on the full 11 M images. Scaling the image encoder from ViT‑B to ViT‑L improves performance noticeably, but further scaling to ViT‑H provides only marginal gains, suggesting limited benefit from larger encoders at this stage.

Experiments Overview

We evaluate SAM on many datasets and announce its public release.

We evaluate SAM across a new benchmark of 23 segmentation datasets, measuring its ability to generate high‑quality masks from a single foreground point. The masks are typically only marginally worse than manually annotated ground truth. Under a zero‑shot transfer protocol, SAM also delivers strong quantitative and qualitative performance on downstream tasks such as edge detection, object proposal generation, instance segmentation, and a preliminary text‑to‑mask prediction.

We release the SA‑1B dataset for research and make SAM available under the Apache 2.0 license at https://segment‑anything.com, accompanied by an online demo showcasing its capabilities.

Discussion

We examine SAM’s place among foundation models, its compositional potential, and its current limits.

Foundation models have become the dominant paradigm for transferring knowledge from large‑scale pre‑training to downstream tasks. While this trend extends to image segmentation, SAM illustrates that a foundation model for segmentation occupies only a fractional slice of the broader computer‑vision landscape. Moreover, unlike approaches that emphasize self‑supervised learning, SAM is initialized with a masked‑autoencoder (MAE) but derives most of its capability from massive supervised training on the SA‑1B dataset.

Beyond pure performance, SAM is designed as a composable component that can be wired into larger systems. By predicting a valid mask for a wide variety of prompts, it offers a reliable interface for downstream modules such as MCC, which leverages SAM to segment objects for single‑view 3‑D reconstruction, and wearable‑device pipelines that use gaze points as prompts for on‑the‑fly segmentation of egocentric video.

Despite its breadth, SAM has notable limitations: it can miss fine structures, occasionally hallucinate small disconnected components, and produces less crisp boundaries than dedicated zoom‑in methods. Interactive segmentation techniques that receive many user points typically achieve higher IoU, and SAM’s heavy image encoder prevents true real‑time performance even though prompt inference is fast. The emerging text‑to‑mask capability remains exploratory, and designing simple prompts for semantic or panoptic segmentation is still an open problem; domain‑specific tools are expected to retain an advantage in their specialized settings.

Model Architecture Details

We detail the encoder, prompt handling, mask decoder, loss, and training pipeline that enable real‑time, promptable segmentation.

The image encoder is a ViT‑H/16 pretrained by MAE [47] and adapted for high‑resolution inputs. It uses 14×14 windowed attention interleaved with four global‑attention blocks [62] and outputs a feature map that is downscaled by a factor of 16 relative to the input image.

Read the original paper

Open the simplified reader on Paperglide