TiDAR: Hybrid Diffusion & Autoregressive LLM for Faster Generation
[HPP] Yannic KilcherDecember 27, 202547 min
26 connectionsΒ·38 entities in this videoβTiDAR: Leveraging Unused GPU Capacity
- π‘ Autoregressive (AR) language models often underutilize GPU capacity during inference, leading to idle compute cycles due to being memory-bound.
- π― Existing methods like speculative decoding or diffusion models face tradeoffs, either sacrificing quality or efficiency in their attempts to speed up generation.
- π TiDAR proposes a "free lunch" approach by smartly using this extra GPU capacity for parallel computation without the typical compromises.
Understanding Language Model Paradigms
- π§ Autoregressive models generate one token at a time, ensuring high quality by conditioning on all preceding tokens, but are inherently slow.
- β‘ Diffusion language models generate multiple future tokens simultaneously, offering speed but often at the cost of quality due to sampling from independent marginal distributions.
- β οΈ Causal attention masking in AR models, while enabling parallel training, restricts attention patterns during inference, limiting theoretical possibilities.
TiDAR's Hybrid Architecture
- π TiDAR operates as a sequence-level hybrid architecture, "thinking" in diffusion to draft tokens and "talking" autoregressively to sample final outputs.
- π οΈ It performs parallel drafting and sampling within a single forward pass, using specially designed structured attention masks to check current drafts and pre-draft for future steps.
- β The model computes multiple potential drafts for the next step, covering all possible acceptance outcomes from the current rejection sampling, ensuring a valid draft is always ready.
Ensuring Quality and Efficiency
- π TiDAR employs rejection sampling to ensure the final output sequence is mathematically equivalent to what a pure autoregressive model would produce, thus closing the quality gap.
- π The model is trained with a combined loss function (autoregressive and diffusion losses), allowing it to perform both tasks effectively within a single architecture.
- π₯ TiDAR achieves 4.71x to 5.91x higher tokens per second compared to AR models, while maintaining comparable quality and outperforming speculative decoding and other diffusion variants in both speed and quality.
Knowledge graph38 entities Β· 26 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
38 entities
Chapters17 moments
Key Moments
Transcript171 segments
Full Transcript
Topics15 themes
Whatβs Discussed
TiDARDiffusion language modelsAutoregressive modelsGPU utilizationSpeculative decodingCausal attention maskingRejection samplingKV cacheParallel generationLanguage modelingNext token predictionLoss functionThroughputQuality gapStructured attention masks
Smart Objects38 Β· 26 links
MediasΒ· 3
ConceptsΒ· 30
CompanyΒ· 1
ProductsΒ· 4