Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino accelerates LLM inference by injecting causal dependencies into parallel draft blocks via a lightweight correction head.

How can we decouple the draft model's parallel token generation from the sequential causal dependencies required for high-quality speculative decoding?

Speculative decoding is bottlenecked by a trade-off: autoregressive drafters produce high-quality sequences but incur high sequential costs, while parallel drafters are efficient but lack the causal dependencies needed for high acceptance rates. Domino decouples these concerns by using a parallel backbone to generate preliminary draft distributions, then applying a lightweight head to inject causal information through a low-rank residual correction. This approach achieves up to 5.8× throughput speedup on Qwen-3 models, outperforming both autoregressive and existing parallel drafting baselines.

Paper Primer

The core mechanism is a two-stage drafting process. A parallel backbone produces base logits for an entire block in one pass, and a Domino head—comprising a Gated Recurrent Unit (GRU) causal encoder and a low-rank correction head—refines these logits by injecting prefix-dependent information without re-running the full language model head.

Domino significantly improves end-to-end speedup and acceptance length compared to parallel-only drafting.

On Qwen3-4B, Domino improves average speedup from 4.70× to 5.47× (greedy) and increases average acceptance length by 16.6% over the DFlash baseline. Up to 5.8× throughput speedup in high-concurrency serving environments.

To stabilize training, the authors employ a base-anchored curriculum that linearly anneals the loss weight from the base logits to the final corrected logits. This forces the parallel backbone to learn a strong foundation before the Domino head takes over the residual correction, preventing the backbone from collapsing.

Why is a low-rank correction head used instead of a full hidden-space update?

A hidden-space correction would require re-applying the full language model head after each causal update, which would reintroduce the expensive sequential computation the method aims to avoid. Logit-space correction allows the base computation to remain parallel.

Why does the paper use teacher forcing for the causal encoder instead of training-time testing (TTT)?

Teacher forcing uses ground-truth prefixes, which focuses the model on the regime where draft tokens are actually accepted. TTT uses self-generated prefixes that are often noisy and incorrect, which can degrade the causal representations learned by the encoder.

Domino demonstrates that causal dependency modeling does not require sequential execution; lightweight residual correction is sufficient to bridge the quality gap between parallel and autoregressive drafting.

Introduction

Speculative decoding speeds LLMs but is limited by a draft‑quality versus drafting‑cost trade‑off.

Large language models achieve state‑of‑the‑art results, yet their standard autoregressive decoding proceeds token by token, making inference latency memory‑bound and under‑utilizing modern GPUs.

Instead of waiting for the target model to generate each token, a lightweight draft model proposes a short block of tokens, and the target model verifies the whole block in parallel.

Generating a draft block of $k=16$ tokens sequentially would invoke the draft model 16 times, each call allocating a $1024\times d$ activation matrix (~16 MB) and a full‑vocabulary LM‑head projection.

The total temporary memory for the draft phase becomes $16\times16\text{ MB}=256\text{ MB}$, plus the $4.2\text{ GB}$ attention buffer.

If the draft block were produced in parallel, only one activation matrix is needed, cutting the draft‑phase memory to $16\text{ MB}$ while keeping the same $4.2\text{ GB}$ attention cost.

This toy calculation shows why the draft‑phase memory and compute dominate the speed‑up budget: parallelizing the draft eliminates the linear $k$‑fold overhead.

The bottleneck therefore is a trade‑off: improving draft quality (longer acceptance length) forces more expensive draft computation, while cheaper parallel drafts sacrifice intra‑block causal modeling and hurt acceptance length.

Domino resolves this tension by keeping draft generation fully parallel (the Parallel Draft Backbone) and then applying a lightweight Domino Head that injects the missing causal information without re‑running the full draft model.

**Figure 1.** Latency breakdown and performance comparison on Qwen3-8B under a 16-token speculative decoding budget. Left: per-step latency breakdown measured on an A100 GPU with context length 1024, where Verify denotes the target-model verification latency and Draft denotes the draft-model forward latency. LM Head, DHead, and Tree denote the output projection, Domino head, and tree construction/sampling overheads, respectively. Right: acceptance length and end-to-end speedup evaluated on GSM8K. All three draft models are trained on the same dataset.

The essential challenge is balancing draft quality against drafting cost.

Related Work

Speculative decoding trades draft quality for cost; Domino separates speed and causal correction.

Speculative decoding speeds up large‑language‑model inference by letting a small draft model propose tokens that a larger target model verifies in parallel. This section surveys how prior work has shaped that pipeline.

Introduced the basic two‑stage workflow: a lightweight draft model generates candidate tokens autoregressively, then the full‑size target model checks them in parallel.

Extended the basic pipeline with hierarchical verification trees and lighter draft architectures to cut both verification overhead and draft cost.

Generates drafts sequentially, letting each draft token condition on its predecessors, which better matches the target model’s autoregressive distribution.

Attaches multiple lightweight parallel decoding heads to a single draft model, each proposing a token for a different future position.

Enhances parallel head drafting by feeding each head a sequentially‑dependent context, re‑injecting causal information into otherwise independent heads.

Static vocabulary selection that restricts the draft model’s output space to a fixed subset of tokens, cutting projection cost.

Dynamic vocabulary selection that adapts the candidate token set per step based on the draft’s confidence.

Uses discrete diffusion models as parallel drafters, generating a block of tokens in a single forward pass.

Training‑free diffusion drafter that leverages pretrained diffusion language models to propose draft blocks.

Transforms an autoregressive model into a parallel draft model by predicting multiple future tokens in one forward pass.

Predicts token distributions for several future positions in parallel and prunes a draft tree to select viable candidates.

Block‑diffusion drafter that emits an entire draft block in a single forward pass, avoiding repeated calls to both the draft model and the full language‑model head.

DFlash treats a draft block as a single diffusion step, producing all tokens at once instead of iterating token‑by‑token.

EAGLE‑3 extends the autoregressive EAGLE line by adding a tiny correction head that refines the last draft token before verification.

Speculative Decoding Foundations

Speculative decoding balances draft quality against drafting cost, creating a fundamental trade-off between sequential and parallel generation.

The efficiency of speculative decoding depends on the drafting strategy. Autoregressive drafters generate tokens sequentially, while parallel drafters generate them simultaneously.

Autoregressive drafters generate tokens one by one, conditioning each new token on all previously drafted ones, which yields high-quality drafts but incurs a sequential cost.

Parallel drafters predict the entire block of $\gamma$ tokens at once, eliminating the sequential bottleneck of autoregressive generation at the cost of weaker intra-block causal dependencies.

The Domino Architecture

Domino decouples draft generation and correction to speed speculative decoding.

Speculative decoding stalls because improving draft quality forces a slower, more expensive draft model, while a cheap draft hurts acceptance.

**Figure 3.** Overview of Domino. The parallel backbone produces hidden states for the whole draft block in one forward pass. The Domino head sequentially updates a causal state from previously sampled draft tokens and generates correction logits $c_i$, which refine the base logits $l_i$. Each draft token is sampled from the final logits $l_i + c_i$.

It turns the draft‑generation bottleneck into a single parallel forward pass, so the whole block’s base distribution is ready before any token is sampled.

Embed the block: $[A,\text{[MASK]},\text{[MASK]}]\rightarrow [(0.1,0.0), (0,0), (0,0)]$.

Backbone processes the three embeddings together and outputs hidden states $H = [(0.2,0.3), (0.15,0.25), (0.12,0.22)]$.

Apply the frozen LM head (a $2\times V$ matrix) to each hidden state, yielding base logits $L^{\text{base}}_1$, $L^{\text{base}}_2$, $L^{\text{base}}_3$.

The three base logit vectors are ready before any token is sampled, enabling the next stage to work purely sequentially.

The parallel backbone collapses an $O(B)$ autoregressive loop into a single matrix‑multiply, cutting draft generation latency dramatically.

How does this differ from a standard autoregressive draft model?

A standard draft model would generate token 1, then feed it back to generate token 2, and so on, incurring $B$ separate forward passes. The Parallel Draft Backbone produces all $B$ hidden states in one pass, eliminating that sequential cost.

The head walks forward through the draft block, adding a cheap, low‑rank correction that injects causal information without re‑running the full LM head.

Concatenate $[H_i;S_{i-1}] = (0.2,0.1,0.0,0.3,0.05,0.0,0.1,0.0)$.

Project with $W_1$ (example values) to obtain a 2‑dim hidden $z = (0.12, -0.04)$.

Apply SiLU: $\sigma(z) = (0.12, 0)$ (negative part zeroed).

Lift with $W_2$ to a 3‑dim correction $\Delta L_i = (0.03, -0.01, 0.02)$.

Add to base logit $L^{\text{base}}_i = (1.2, 0.8, 0.5)$ → final logit $L_i = (1.23, 0.79, 0.52)$.

The correction nudges the base distribution toward the true next token while costing only a few matrix‑vector products.

Why not apply the correction directly to the hidden state $H_i$ instead of the logits?

Correcting $H_i$ would require feeding the updated hidden vector through the full LM head again, re‑incurring the $O(V)$ projection for every position. Logit‑space correction reuses the already‑computed $L^{\text{base}}_i$ and adds a cheap residual, preserving the parallel speedup.

During training we feed the GRU the ground‑truth draft prefixes so it learns clean causal representations aligned with the acceptance regime.

Does teacher forcing limit the model’s ability to recover from early mistakes at inference time?

No—because the correction is only consulted after a token has been accepted, the model never needs to condition on a wrong prefix during inference. Training on clean prefixes therefore aligns perfectly with the test‑time usage.

We jointly supervise the parallel backbone and the final output, gradually shifting emphasis from the base logits to the corrected logits.

What would happen if $\lambda_t$ were kept at 0 throughout training?

The model would only see the final‑logit loss, allowing the correction branch to absorb most of the predictive work while the parallel backbone receives no direct supervision, leading to a weak base distribution and loss of the intended speedup.

For deployment we fuse the correction loop into a single Triton kernel and capture the whole rollout with CUDA Graphs, cutting kernel‑launch overhead and halving the Domino‑head latency.

Experimental Results

Domino’s speedup and acceptance length set new baselines for speculative decoding.

We evaluate Domino on two Qwen3 model sizes across math, code, and dialogue benchmarks, comparing against autoregressive and speculative baselines.

Domino attains a peak 7.92× speedup over vanilla autoregressive decoding on the GSM8K benchmark.

Figure 2 shows Domino’s speedup curve dominating all baselines, with the highest recorded factor of 7.92× on GSM8K.

**Figure 2.** Speedup comparison of Domino, DFlash, and EAGLE-3 relative to autoregressive decoding on Qwen3-8B using the Transformers backend.

Table 1 (low‑concurrency) and Table 2 (high‑concurrency) further confirm Domino’s superiority: it delivers the highest speedup and acceptance length across Qwen3‑4B/8B models, and scales to higher serving throughput under SGLang.

Ablation Studies

We isolate which components drive Domino’s gains through targeted ablations.

We run ablations on Qwen3‑8B with a fixed 16‑token drafting budget to see whether training data, the training strategy, or the lightweight Domino head are responsible for the observed improvements.

Training‑data ablation keeps the architecture identical while varying only the source corpus.

EAGLE‑3 reaches an average acceptance length of 5.01 on GSM8K, but its throughput is limited to roughly 1.90× speedup.

Table 3 shows the 5.01 acceptance length and the corresponding low speedup for EAGLE‑3.

DFlash attains a shorter acceptance length of 3.90 on GSM8K while delivering a higher 2.84× speedup.

Table 3 reports the 3.90 acceptance length and the 2.84× speedup for DFlash.

Training‑strategy ablation isolates the causal correction branch, comparing teacher forcing (TF) and the combined TF + Curriculum regime.

Switching from training‑time test to teacher forcing raises the average acceptance length from 3.80 to 3.96.

Figure 4 (right panel) reports the 3.80 → 3.96 improvement.

Adding the base‑anchored curriculum (TF + Curr) further lifts acceptance length to 4.19, a gain of +0.23 over plain TF.

Figure 4 (right panel) shows the 4.19 value for TF + Curr.

Finally, we ablate the Domino head itself by disabling the causal correction branch.

Enabling the Domino head improves average acceptance length by +0.70 (3.49 → 4.19) and speedup by +0.47× (2.84× → 3.31×).

Table 4 contrasts “w/o Domino Head” and “w/ Domino Head”.

**Figure 4.** Left: parallel backbone loss with and without the base-anchored curriculum. Right: average acceptance length under TTT, TF, and TF+Curr. TTT denotes training-time testing, TF denotes teacher forcing, and Curr denotes the base-anchored curriculum. The gray dashed line denotes the DFlash reference.

These ablations demonstrate that the lightweight Domino head, together with teacher‑forced training and the base‑anchored curriculum, are the primary sources of Domino’s superior speed‑accuracy trade‑off.

Training Details

Appendix provides training configurations and full ablation results for the Domino head.

We train the DOMINOdraft module while keeping the target model frozen, using the regenerated PerfectBlend data. Input sequences are truncated to 3072 tokens and the draft block size is set to 16. All draft modules are trained for 3 epochs on eight NVIDIA A100‑SXM4‑80GB GPUs with a per‑GPU batch size of 2 (global batch size 16).

Optimization uses AdamW with a learning rate of $6 \times 10^{-4}$, zero weight decay, and gradient clipping at a maximum norm of 1.0. The learning‑rate schedule is cosine with a warmup ratio of 0.04, and training runs in bfloat16 precision with FSDP and gradient sharding.

Table 5 lists the baseline draft model checkpoints used for Qwen3‑4B and Qwen3‑8B (EAGLE‑3, DART, DFlash). Table 6 reports benchmark‑level results for the Domino head ablation, showing both the “without Domino Head” and “with Domino Head” configurations across seven benchmarks and the corresponding Accept Length speedup factors.

Read the original paper

Open the simplified reader on Paperglide