TiDAR: Hybrid Diffusion & Autoregressive LLM for Faster Generation

[HPP] Yannic KilcherDecember 27, 202547 min

26 connections·38 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

TiDAR: Leveraging Unused GPU Capacity

💡 Autoregressive (AR) language models often underutilize GPU capacity during inference, leading to idle compute cycles due to being memory-bound.
🎯 Existing methods like speculative decoding or diffusion models face tradeoffs, either sacrificing quality or efficiency in their attempts to speed up generation.
🔑 TiDAR proposes a "free lunch" approach by smartly using this extra GPU capacity for parallel computation without the typical compromises.

Understanding Language Model Paradigms

🧠 Autoregressive models generate one token at a time, ensuring high quality by conditioning on all preceding tokens, but are inherently slow.
⚡ Diffusion language models generate multiple future tokens simultaneously, offering speed but often at the cost of quality due to sampling from independent marginal distributions.
⚠️ Causal attention masking in AR models, while enabling parallel training, restricts attention patterns during inference, limiting theoretical possibilities.

TiDAR's Hybrid Architecture

🚀 TiDAR operates as a sequence-level hybrid architecture, "thinking" in diffusion to draft tokens and "talking" autoregressively to sample final outputs.
🛠️ It performs parallel drafting and sampling within a single forward pass, using specially designed structured attention masks to check current drafts and pre-draft for future steps.
✅ The model computes multiple potential drafts for the next step, covering all possible acceptance outcomes from the current rejection sampling, ensuring a valid draft is always ready.

Ensuring Quality and Efficiency

📊 TiDAR employs rejection sampling to ensure the final output sequence is mathematically equivalent to what a pure autoregressive model would produce, thus closing the quality gap.
📈 The model is trained with a combined loss function (autoregressive and diffusion losses), allowing it to perform both tasks effectively within a single architecture.
🔥 TiDAR achieves 4.71x to 5.91x higher tokens per second compared to AR models, while maintaining comparable quality and outperforming speculative decoding and other diffusion variants in both speed and quality.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph38 entities · 26 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

38 entities

Chapters17 moments

Key Moments

Transcript171 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

TiDARDiffusion language modelsAutoregressive modelsGPU utilizationSpeculative decodingCausal attention maskingRejection samplingKV cacheParallel generationLanguage modelingNext token predictionLoss functionThroughputQuality gapStructured attention masks

Smart Objects38 · 26 links

Medias· 3

Concepts· 30

Company· 1

Products· 4

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free