LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything accelerates VLM visual grounding by decoding bounding boxes as atomic parallel units instead of serial token streams.

How can we make vision-language models perform visual grounding and detection faster and more accurately by treating bounding boxes as atomic units rather than serial token streams?

Vision-language models typically generate bounding boxes as a sequence of individual coordinate tokens, creating a serial bottleneck that slows down inference and ignores the coupled nature of 2D geometry. LocateAnything introduces Parallel Box Decoding (PBD), which treats each bounding box as an atomic unit and predicts all four coordinates in a single parallel step. This approach achieves up to 2.5× higher decoding throughput than existing methods while simultaneously improving localization accuracy across diverse benchmarks.

Paper Primer

Standard next-token prediction forces models to serialize 2D boxes into 1D streams, which is both slow and structurally mismatched to the task. LocateAnything replaces this with a block-based formulation where the model predicts the full coordinate set for a box simultaneously, preserving geometric coherence.

To handle complex scenes where parallel decoding might falter, the authors implement a Hybrid Mode. The model defaults to parallel generation but monitors for format violations or spatial ambiguity, falling back to standard autoregressive decoding only for problematic blocks.

LocateAnything significantly advances the speed-accuracy frontier for visual grounding.

Evaluations on LVIS and COCO benchmarks show consistent gains in F1-score at high IoU thresholds compared to state-of-the-art VLM-based detectors. Achieves up to 2.5× higher decoding throughput than quantized-based models and 10× higher than standard textual-based VLMs.

The model maintains high precision in dense, cluttered environments.

Performance on dense detection benchmarks like VisDrone and Dense200. Outperforms the related Rex-Omni model by 4.1 points on VisDrone mean F1.

Why is this approach better than simply using existing multi-token prediction methods?

Existing methods treat token streams as generic sequences, leading to arbitrary chunking that ignores the structure of bounding boxes. LocateAnything aligns its prediction blocks specifically with the atomic units of a bounding box, preventing the model from learning unreliable cross-boundary patterns.

What is the primary trade-off when using the different inference modes?

Fast Mode maximizes throughput for compute-constrained settings but may struggle with spatial ambiguity in dense scenes. Hybrid Mode mitigates this by dynamically switching to autoregressive decoding for unreliable blocks, preserving most speed gains while ensuring robust, high-precision output.

Researchers and engineers can now treat visual grounding as a high-throughput parallel task without sacrificing the structural precision required for dense or complex object detection.

Motivation and Problem Framing

We expose the bottleneck of serial coordinate tokenization and motivate Parallel Box Decoding as a solution.

Vision‑language models (VLMs) treat visual grounding as a generative task: they emit the coordinates of each target box as a sequence of tokens. Under the standard Next Token Prediction (NTP) paradigm, a model predicts one token at a time, so a 2‑D box is broken into a series of 1‑D coordinate tokens. This “coordinate‑token generation” couples a spatial object with a linear stream, but it forces strictly sequential decoding, which creates a latency bottleneck and ignores the inherent geometric coherence of a box.

Textual Digits emit each decimal digit separately: 4 coordinates × 4 digits = 16 tokens, plus delimiters → 21 tokens total.

Quantized Tokens treat each coordinate as a single quantized token, yielding 4 tokens (one per $x_1, y_1, x_2, y_2$).

Parallel Box Decoding would emit the entire box as a single atomic block, requiring only 1 token.

Serial tokenization inflates the decoding length dramatically, turning a simple box into a long autoregressive chain that dominates inference time.

**Figure 1.** Versatile tasks of LocateAnything with parallel box decoding. Top: LocateAnything supports diverse localization tasks under a unified vision-language model. Bottom: Textual digit decoding spells coordinates digit by digit, and quantized coordinate decoding predicts coordinate tokens sequentially. In contrast, Parallel Box Decoding predicts each geometric unit (e.g., a bounding box) in a single forward pass.

Our contributions are threefold: (1) we introduce LocateAnything, a unified framework that applies Multi‑Token Prediction to VLM‑based detection by treating each box as an atomic unit (Parallel Box Decoding); (2) we design a hybrid decoding policy that falls back to NTP only when the parallel output is unreliable; (3) we assemble LocateAnything‑Data, a 138‑million‑sample dataset that fuels high‑precision localization across many benchmarks.

The serial coordinate‑token approach is a fundamental inefficiency that Parallel Box Decoding eliminates.

Parallel Box Decoding Mechanism

LocateAnything replaces sequential coordinate tokens with atomic box blocks for fast, consistent visual grounding.

Instead of sending each coordinate token one after another, the model parcels the whole box as a fixed‑size block—like mailing a complete parcel rather than listing every item separately.

Tokenize each coordinate into a discrete token (e.g., 100 → “t100”).

Insert the opening structural token <box> at the start of the block.

Place the four coordinate tokens in order.

Add the closing structural token </box> as the sixth position.

If the model processes a batch of three objects, each block is padded with <null> tokens where an object is missing, preserving the 6‑slot shape.

This toy example shows how a fixed‑size block lets the model predict all six tokens in one parallel step, eliminating the need for a token‑by‑token chain.

How does Parallel Box Decoding differ from the standard Next Token Prediction of coordinates?

In NTP each coordinate token is generated autoregressively, so the model must wait for the previous token before predicting the next. PBD treats the six tokens that form a box as a single atomic unit and predicts them simultaneously, removing the sequential dependency and enabling parallel computation.

The output stream is organized as a sequence of fixed‑length blocks, each dedicated to a specific role—semantic, spatial, negative, or termination.

Why does the model need four distinct block types instead of a single generic block?

Each block type conveys a different kind of information that the downstream task consumes differently: semantic tokens guide language understanding, box coordinates provide spatial grounding, a negative block explicitly tells the model that no object matches the query, and the end block cleanly terminates generation. Keeping them separate lets the model learn specialized patterns for each role while still sharing the same parallel decoding infrastructure.

The model learns both the traditional token‑level NTP sequence and the block‑wise MTP representation at once, using a mask that isolates the two streams while still sharing the visual and textual context.

Encode the input image ℐ with Moon‑ViT to obtain visual tokens Z.

Tokenize the text query E into a sequence of language tokens.

Form the standard NTP token sequence `x_ntp` by concatenating the query tokens with coordinate tokens.

Convert `x_ntp` into the block sequence `x_blk` by grouping every six tokens; keep the first token of each block and replace the remaining five with [mask] tokens.

Concatenate `x_vis`, $x_q$, `x_ntp`, and `x_blk` into `x_all`.

Apply the block‑causal attention mask that implements causal NTP, causal block flow, and bidirectional intra‑block attention.

Compute cross‑entropy losses for both the NTP and MTP streams and back‑propagate the summed loss ℒ.

Why are mask tokens inserted into `x_blk` instead of feeding the full block directly?

The mask tokens hide the future tokens of each block during training, forcing the model to infer them from the first token and the shared context. This teaches the decoder to predict an entire block in one step at inference time, rather than learning to copy the tokens verbatim.

At runtime the model can operate in three modes: fully autoregressive (slow), fully parallel (fast), or a hybrid that defaults to parallel but falls back to autoregressive when the parallel output looks unreliable.

Fast mode predicts both blocks in parallel.

The model evaluates the second block’s confidence and spread, detecting a violation of the hybrid trigger thresholds.

The second block is discarded; the decoder reverts to NTP, generating the four coordinate tokens sequentially.

After NTP finishes the second block, the model resumes Fast mode for any subsequent blocks.

This walk‑through shows how the hybrid mode preserves speed for easy cases while automatically invoking the safe autoregressive path when the parallel prediction is dubious.

What exactly causes the hybrid mode to switch back to NTP for a block?

The switch is triggered when either (i) the highest‑probability coordinate token has a confidence below 0.7, or (ii) the difference between the highest and lowest probabilities among the top‑5 coordinate tokens exceeds 80 in the normalized [0, 1000] range. Either condition signals that the parallel prediction is unreliable, prompting a fallback to the more conservative NTP process.

**Figure 2.** Comparison of Token Decoding Methods. The NTP generates coordinate values one-by-one. The standard MTP method results in irregular distributions and non-coherent, unstructured patterns. Our proposed PBD generates a single atomic box (or point) unit in a parallel step, ensuring box-aligned and structured output.

**Figure 3.** Architecture and Block-Based Output Representation. LocateAnything formulates localization as generating a sequence of fixed-length, box-aligned atomic blocks. Four functional block types—Semantic, Box, Negative, and End blocks—are defined to jointly specify predicted entities or termination states.

**Figure 4.** Attention Mask for Joint NTP–MTP Training. The shared context and NTP stream use causal attention, the MTP blocks follow a block-causal pattern across blocks, and tokens within the same block share bidirectional attention. The two streams are isolated to prevent leakage while jointly attending to the shared context.

**Figure 5.** Corrected NTP Re-decoding. When parallel decoding encounters Format Irregularity or Spatial Ambiguity, the model discards the erroneous block and reverts to standard NTP to ensure robust predictions.

Empirical Evaluation and Ablations

LocateAnything delivers >10× faster throughput and higher detection accuracy.

LocateAnything achieves 12.7 boxes per second (BPS) throughput, over 10× faster than textual NTP baselines, while also improving detection accuracy.

Table 1 shows 12.7 BPS for LocateAnything‑3B versus 1.1 BPS for Qwen3‑VL and 5.0 BPS for Rex‑Omni; the same table reports a +3.8 % LVIS F1 gain and +1.8 % COCO F1 gain over Rex‑Omni.

LocateAnything‑Data packs an object's class label and its four box coordinates into a single fixed‑length token block, allowing the model to emit whole boxes in parallel.

How does LocateAnything‑Data differ from traditional coordinate‑token generation?

Traditional NTP emits each coordinate as a separate token, forcing a step‑by‑step generation; LocateAnything‑Data emits the entire box as one block, so all four coordinates and the class are produced together, enabling parallel decoding.

**Table.** Comparison of object detection performance on LVIS and COCO datasets across different model categories, including Open-set Specialized Detectors, Closed-set Specialized Detectors, and Vision-Language Models.

**Table.** Comparison of various models on Dense200 and VisDrone datasets using F1@IoU metrics.

**Table 4.** Comparison of document understanding performance on DocLayNet, M6Doc, and TotalText datasets.

**Table 4.** Performance comparison on document layout grounding and OCR tasks.

**Table 1.** (a) Coordinate Representations, (b) MTP Formulations, and (c) Decoding Modes & Losses.

**Table 6.** Ablation Studies on the COCO dataset. We decouple the analysis into three aspects: (a) coordinate representation, (b) block-based MTP Formulation, and (c) effectiveness of our on-demand decoding modes and loss design. Throughput is measured in boxes per second. For brevity, we report the Average metric across IoU thresholds for Recall (R), Precision (P), and F1 Score. “B” indicates block size in MTP.

**Figure 7.** Ablation Study on Box Ordering and Decoding Speed. Left: Effect of different box sorting strategies on the F1-score. Right: Comparison of Generation Time (bars) and Throughput (lines) across varying numbers of predicted boxes for Textual, Quantized, and Parallel box decoding.

**Figure.** Examples of object detection with varying numbers of bounding boxes, categorized into 1-10, 10-20, and 20+ boxes.

**Figure 12.** Qualitative comparison on Referring Expression Comprehension (REC). Compared to Qwen3-VL and Rex-Omni, LocateAnything demonstrates superior compositional grounding capabilities. It accurately aligns nuanced, free-form human intents (e.g., spatial or attribute-based queries) with correct visual regions.

**Figure 13:** Qualitative comparison on Dense Object Detection (DOD). This figure illustrates performance in highly dense and heavily overlapping environments, such as stacked logs and abacus beads. While traditional token-by-token generation models (Qwen3-VL) and point-based models (Rex-Omni) suffer from severe omissions or spatial ambiguity (blurring boundaries between adjacent objects), LocateAnything maintains compact, well-separated, and highly accurate bounding boxes. This confirms the effectiveness of our block-level intra-attention and dense-aware Stage-2 training.

**Figure 14.** Qualitative comparison on Optical Character Recognition (OCR). For scene text (e.g., magazine covers) and structured documents (e.g., tables), LocateAnything yields tightly bounded boxes around text elements. The baseline models frequently exhibit format irregularities or merge distinct text blocks. Our parallel decoding, combined with the Hybrid Mode fallback for complex spatial layouts, ensures high-precision localization without sacrificing structural coherence.

Summary and Future Directions

LocateAnything unifies visual grounding with Parallel Box Decoding, delivering SOTA accuracy and real‑time speed.

We introduced LocateAnything, a framework that treats bounding boxes as atomic blocks through Parallel Box Decoding, aligning supervision with the inherently coupled spatial coordinates and delivering state‑of‑the‑art accuracy across a range of vision‑language tasks.

With 138 M text‑image training queries and an on‑demand inference mechanism, the system attains up to 2.5× speedup over prior approaches, offering a practical path to real‑time visual perception for latency‑sensitive embodied robotics and interactive agents.

Current limitations stem from reliance on supervised fine‑tuning; future work will explore reinforcement learning to refine block‑level decoding policies, reduce fallback frequency, and improve robustness and worst‑case speed.

We thank numerous collaborators and supporting teams for discussions, data, and infrastructure that made this work possible.

Training Configurations and Data Details

Extended training, data, and evaluation details for LocateAnything.

This appendix expands the training pipeline, inference settings, data construction, and extra experiments that support LocateAnything.

Base VLM training proceeds in two stages: Stage 1 learns visual concepts from caption datasets, and Stage 2 adds a broad multimodal mixture covering math, science, OCR, VQA, and counting.

LocateAnything fine‑tuning also uses two stages: Stage 3 trains on 138 M queries with full model unfreezing (LR = 4 × 10⁻⁵, seq len = 25 600), while Stage 4 emphasizes dense detection by raising the proportion of multi‑object data and lowering the LR to 1 × 10⁻⁵.

Stream Packing mitigates variable‑length waste by weighted sampling, best‑fit buffering, and big‑rocks‑first seeding, achieving over 95 % packing efficiency for a target token budget.

MagiAttention handles the heterogeneous attention masks produced by the NTP + MTP dual formulation, enabling efficient distributed computation on ultra‑long packed sequences.

Inference uses nucleus sampling (temperature 0.7, top‑p 0.9), a repetition penalty of 1.1, and a block size of 6 for MTP, generating up to 8 192 tokens per request in BF16 with batch 1.

After each MTP step the KV cache is truncated to retain only the committed prefix, discarding mask and anchor tokens to keep the cache consistent with the causal training prefix.

Data construction draws from open‑source detection and grounding sources, augments GUI labels via Qwen3‑VL to produce richer queries, and injects negative samples to curb hallucination.

GroundCUA annotations are enriched by rendering each bbox, feeding the crop and metadata to Qwen3‑VL, and extracting appearance, spatial, and functional descriptions.

The multi‑target grounding engine first uses Qwen3‑VL prompts on detection boxes, then Molmo to predict points that are kept if they fall inside the ground‑truth boxes; for unlabeled images it generates queries, predicts points, and converts them to boxes via SAM 3 or Rex‑Omni, followed by a final Qwen3‑VL verification.

Task‑specific prompts map each visual task to a fixed output format (box or point) and a natural‑language template, with placeholders for free‑form phrases and category lists.

The assembled LocateAnything‑Data contains over 139 M queries, 22 M negatives, and exhibits a long‑tailed distribution of target counts per query.

Additional experiments evaluate point‑level accuracy, decoding‑mode trade‑offs, and backbone independence.

LocateAnything‑3B in Hybrid mode attains state‑of‑the‑art F1@Point scores (e.g., 91.0 on RefCOCOg test, 87.6 on Dense200), surpassing larger baselines.

Fast mode reaches 15.3 BPS with modest accuracy loss, Slow mode yields the highest F1 but lowest throughput, and Hybrid adapts between them to achieve 12.7 BPS while preserving precision.

On the Qwen3‑VL‑4B backbone, adding Parallel Box Decoding improves F1 from 50.8 to 52.0 (Hybrid) and boosts BPS from 2.8 to 9.4, confirming that the speed–accuracy gains are backbone‑agnostic.

Slow mode provides the upper accuracy bound, Fast mode maximizes throughput, and Hybrid selectively falls back to autoregressive steps when spatial ambiguity is detected.

Throughput benchmarks (Boxes Per Second) are measured on COCO; image short sides are set to 840 px for COCO/LVIS and left unchanged for other datasets.

**Table 7.** Detailed configuration for each training stage of LocateAnything.

**Table 1.** Categorization of datasets used in the study.

**Figure 9.** Data engine for multi-targets grounding. Top: For detection datasets with gt boxes, we use each box category as a prompt to Qwen3-VL (Bai et al., 2025a) to synthesize detailed object-centric queries, including attributes, spatial relations, and reasoning cues. These queries are then fed to Molmo (Deitke et al., 2025) to predict candidate points, from which we retain the points falling inside the corresponding gt boxes as reliable supervision. Bottom: For a large collection of high-quality unlabeled images, Qwen3-VL directly generates diverse queries from the image. Such queries can be used to prompt Molmo for point prediction, followed by SAM 3 (Carion et al., 2025) to produce boxes, or directly prompt Rex-Omni (Jiang et al., 2025a) to generate boxes. All generated boxes are finally post-verified by Qwen3-VL.

**Table 1.** Task, Output, and Question Template mapping.

**Figure 10.** Distribution of the number of targets per query across different domains. The x-axis shows the number of targets associated with a query, while the y-axis (log scale) indicates the number of queries.

**Table 11.** Performance evaluation for the object pointing task across a diverse range of benchmarks (COCO, LVIS, Dense200, VisDrone, RefCOCOg, HumanRef). F1-scores are used as the primary metric. The results of the Hybrid Mode are reported here.

**Table 13.** Backbone generalization.

**Table 12.** Comprehensive performance of our Fast, Hybrid, and Slow configurations across multiple visual tasks. Throughput (measured in Boxes Per Second, BPS) is reported in the header for each mode. For general detection (COCO, LVIS), we report Average Precision (AP), Average Recall (AR), and F1@mIoU. For other tasks, we report the primary comprehensive metric (e.g., F1@mIoU, F1@mIoU, Avg Acc).

Read the original paper

Open the simplified reader on Paperglide