NVIDIA AI Introduces TiDAR: A Hybrid Diffusion Autoregressive Architecture For High Throughput LLM Inference

Thank you for reading this post, don't forget to subscribe!

How far can we push large language model speed by reusing “free” GPU compute, without giving up autoregressive level output quality? NVIDIA researchers propose TiDAR, a sequence level hybrid language model that drafts tokens with diffusion and samples them autoregressively in a single forward pass. The main goal of this research is to reach autoregressive quality while significantly increasing throughput by exploiting free token slots on modern GPUs.

Systems motivation, free token slots and the quality problem

Autoregressive transformers decode one token per step. At realistic batch sizes, decoding is usually memory bound, because latency is dominated by loading weights and KV cache, not by floating point operations. Increasing the number of tokens in the input sequence within the memory bound region does not change latency much, since the same parameters and cache are reused.

Masked diffusion language models already exploit this. Given a prefix, they can append multiple masked positions and predict several tokens in parallel in one denoising step. The research team calls these additional positions free token slots, because profiling shows that sending more tokens in this regime barely changes the forward time.

However, diffusion LLMs like Dream and Llada still underperform strong autoregressive baselines on quality. When these models decode multiple tokens in the same step, they sample each token independently from a marginal distribution, given a noised context. This intra step token independence hurts sequence level coherence and factual correctness, and the best quality is usually obtained when decoding only 1 token per step. In practice, this removes much of the theoretical speed advantage of diffusion decoding.

TiDAR is designed to preserve the compute efficiency of diffusion while recovering autoregressive quality, using a single backbone and standard transformer infrastructure.

Architecture, dual mode backbone and attention masks

At a high level, TiDAR partitions the sequence at each generation step into three sections:

A prefix of accepted tokens.

Tokens drafted in the previous step.

Mask tokens that will hold pre drafted candidates for the next step.

The model applies a structured attention mask across this sequence. Prefix tokens attend causally, which supports chain factorized next token prediction, as in a standard autoregressive transformer. Tokens in the drafting region and mask region attend bidirectionally within a block, which enables diffusion style marginal predictions over many positions in parallel. This layout is a modification of the Block Diffusion mask, where only the decoding block is bidirectional and the rest of the sequence remains causal.

To enable both modes in the same backbone, TiDAR doubles the sequence length at training time. The original input occupies the causal section, and a corrupted copy occupies the diffusion section. In the causal section, labels are shifted by 1 token to match the next token prediction objective. In the diffusion section, labels are aligned with the input positions.

Crucially, TiDAR uses a full mask strategy. All tokens in the diffusion section are replaced by a special mask token, rather than sampling a sparse corruption pattern. This makes the diffusion loss dense, keeps the number of loss terms in diffusion and autoregressive parts equal to the sequence length, and simplifies balancing the two losses with a single weighting factor. The research team set this weighting factor to 1 in most experiments.

Self speculative generation in one forward pass

Generation is formulated as a self speculative process that runs in a single network function evaluation per step.

Step 1, given the prompt, TiDAR encodes the prefix causally and performs one step diffusion over the mask positions, producing a block of drafted tokens.

Step 2 and later steps, each forward pass performs two operations at once

Verification of drafted tokens using autoregressive logits over the extended prefix with a rejection sampling rule, similar in spirit to speculative decoding.

Pre drafting of the next block using diffusion, conditioned on all possible acceptance outcomes of the current step.

Accepted tokens are added to the prefix, and their KV cache entries are retained. Rejected tokens are discarded, and their cache entries are evicted. The drafting and verification share the same backbone and attention mask, so diffusion computation uses the free token slots in the same forward pass.

The model supports two sampling modes, trusting autoregressive predictions or trusting diffusion predictions, which control how strongly the final sample follows each head. Experiments show that for the 8B model, trusting diffusion predictions is often beneficial, especially on math benchmarks, while retaining autoregressive quality through rejection sampling.

On the systems side, the attention layout and number of tokens per step are fixed. TiDAR pre initialises a block attention mask and reuses slices of this mask across decoding steps using Flex Attention. The architecture supports exact KV cache, like Block Diffusion. The implementation never recomputes KV entries for accepted tokens and introduces no extra inference time hyperparameters.

Training recipe and model sizes

TiDAR is instantiated by continual pretraining from Qwen2.5 1.5B and Qwen3 4B and 8B base models. The 1.5B variant is trained on 50B tokens with block sizes 4, 8 and 16. The 8B variant is trained on 150B tokens with block size 16. Both use maximum sequence length 4096, cosine learning rate schedule, distributed Adam, BF16, and a modified Megatron LM framework with Torchtitan on NVIDIA H100 GPUs.

Evaluation covers coding tasks HumanEval, HumanEval Plus, MBPP, MBPP Plus, math tasks GSM8K and Minerva Math, factual and commonsense tasks MMLU, ARC, Hellaswag, PIQA, and Winogrande, all implemented via lm_eval_harness.

Quality and throughput results

On generative coding and math tasks, TiDAR 1.5B is highly competitive with its autoregressive counterpart, while generating an average 7.45 tokens per model forward. TiDAR 8B incurs only minimal quality loss relative to Qwen3 8B while increasing generation efficiency to 8.25 tokens per forward pass.

On knowledge and reasoning benchmarks evaluated by likelihood, TiDAR 1.5B and 8B match the overall behaviour of comparable autoregressive models, because likelihood is computed with a pure causal mask. Diffusion baselines such as Dream, Llada and Block Diffusion require Monte Carlo based likelihood estimators, which are more expensive and less directly comparable.

In wall clock benchmarks on a single H100 GPU with batch size 1, TiDAR 1.5B reaches an average 4.71 times speedup in decoding throughput relative to Qwen2.5 1.5B, measured in tokens per second. TiDAR 8B reaches 5.91 times speedup over Qwen3 8B, again while maintaining comparable quality.

Compared with diffusion LLMs, TiDAR consistently outperforms Dream and Llada in both efficiency and accuracy, under the constraint that diffusion models decode 1 token per forward pass for best quality. Compared with speculative frameworks such as EAGLE-3 and training matched Block Diffusion, TiDAR dominates the efficiency quality frontier by converting more tokens per forward into real tokens per second, thanks to the unified backbone and parallel drafting and verification.

Key Takeaways

TiDAR is a sequence level hybrid architecture that drafts tokens with diffusion and samples them autoregressively in a single model pass, using a structured attention mask that mixes causal and bidirectional regions.

The design explicitly exploits free token slots on GPUs, it appends diffusion drafted and masked tokens to the prefix so that many positions are processed in one forward pass with almost unchanged latency, improving compute density during decoding.

TiDAR implements self speculative generation, the same backbone both drafts candidate tokens with one step diffusion and verifies them with autoregressive logits and rejection sampling, which avoids the separate draft model overhead of classic speculative decoding.

Continual pretraining from Qwen2.5 1.5B and Qwen3 4B and 8B with a full mask diffusion objective allows TiDAR to reach autoregressive level quality on coding, math and knowledge benchmarks, while keeping exact likelihood evaluation through pure causal masking when needed.

In single GPU, batch size 1 settings, TiDAR delivers about 4.71 times more tokens per second for the 1.5B model and 5.91 times for the 8B model than their autoregressive baselines, while outperforming diffusion LLMs like Dream and Llada and closing the quality gap with strong autoregressive models.

Comparison

AspectStandard autoregressive transformerDiffusion LLMs (Dream, LLaDA class)Speculative decoding (EAGLE 3 class)TiDARCore ideaPredicts exactly 1 next token per forward pass using causal attentionIteratively denoises masked or corrupted sequences and predicts many tokens in parallel per stepUses a draft path to propose multiple tokens, target model verifies and accepts a subsetSingle backbone drafts with diffusion and verifies with autoregression in the same forward passDrafting mechanismNone, every token is produced only by the main modelDiffusion denoising over masked positions, often with block or random maskingLightweight or truncated transformer produces draft tokens from the current stateOne step diffusion in a bidirectional block over mask tokens appended after the prefixVerification mechanismNot separate, sampling uses logits from the same causal forwardUsually none, sampling trusts diffusion marginals inside each step which can reduce sequence level coherenceTarget model recomputes logits for candidate tokens and performs rejection sampling against the draft distributionSame backbone produces autoregressive logits on the prefix that verify diffusion drafts through rejection samplingNumber of models at inferenceSingle modelSingle modelAt least one draft model plus one target model in the usual setupSingle model, no extra networks or heads beyond AR and diffusion output projectionsToken parallelism per forward1 new decoded token per network function evaluationMany masked tokens updated in parallel, effective window depends on schedule and remasking policySeveral draft tokens per step, final accepted tokens usually fewer than drafted onesAround 7.45 tokens per forward for 1.5B and around 8.25 tokens per forward for 8B under the reported setupTypical single GPU decoding speedup vs AR (batch size 1)Baseline reference, defined as 1 timesBest tuned variants can reach around 3 times throughput versus strong AR baselines, often with quality trade offs on math and coding tasksEmpirical reports show around 2 to 2.5 times throughput versus native autoregressive decodingReported 4.71 times speedup for 1.5B and 5.91 times for 8B compared to matched autoregressive Qwen baselines on a single H100 with batch size 1Quality versus strong AR baselineReference quality on coding, math and knowledge benchmarksCompetitive in some regimes but sensitive to decoding schedule, quality can drop when step count is reduced to chase speedUsually close to target model quality when acceptance rate is high, can degrade when draft model is weak or misalignedMatches or closely tracks autoregressive Qwen baselines on coding, math and knowledge tasks while achieving much higher throughputLikelihood evaluation supportExact log likelihood under causal factorisation, standard lm eval harness compatibleOften needs Monte Carlo style estimators or approximations for sequence level likelihoodUses the original autoregressive model for log likelihood, so evaluation is exact but does not use the speed tricksUses pure causal mask during evaluation, so likelihood is computed exactly like an autoregressive transformerKV cache behaviourStandard cache, reused for all previous tokens, one token added per stepCache use depends on specific diffusion design, some methods repeatedly rewrite long segments which increases cache churnNeeds KV cache for both draft and target models, plus extra bookkeeping for verified and rejected tokensExact KV cache sharing across diffusion and autoregressive parts, accepted tokens are cached once and never recomputed, rejected tokens are evicted

TiDAR is a useful step toward bridging autoregressive decoding and diffusion language models using one unified backbone. By exploiting free token slots and self speculative generation, it raises tokens per network function evaluation without degrading GSM8K, HumanEval, or MMLU performance relative to Qwen baselines. The full mask diffusion objective and exact KV cache support also make it practical for production style serving on H100 GPUs. Overall, TiDAR shows that diffusion drafting and autoregressive verification can coexist in a single efficient LLM architecture.

Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source link