Skip to main content

ALiBi: Attention with Linear Biases for Transformer Extrapolation

[HPP] Ashish VaswaniJanuary 1, 202611 min
34 connections·39 entities in this video→

The Challenge of Long Sequence Training

  • πŸ’‘ Training models with longer sequences is computationally expensive and complex.
  • 🎯 The goal is to extrapolate to longer sequences at inference time, beyond what was seen during training.
  • ⚠️ Traditional position embeddings, such as sinusoidal and rotary methods, are insufficient for efficient extrapolation.

How ALiBi Works

  • 🧠 ALiBi (Attention with Linear Biases) is introduced as a simpler and more efficient positional method.
  • 🚫 It does not add positional embeddings to word embeddings directly.
  • πŸ”‘ Instead, ALiBi biases query-key attention scores with a negative penalty that is proportional to their distance.
  • ✨ This bias is static and non-learned, introducing no additional parameters or runtime overhead.
  • πŸš€ ALiBi incorporates an inductive bias towards recency, meaning closer keys have a greater impact on the query.

Performance and Extrapolation Benefits

  • πŸ“ˆ ALiBi models demonstrate superior extrapolation capabilities for longer inference sequences.
  • πŸ“Š They achieve lower perplexity compared to sinusoidal and rotary position embeddings when extrapolating beyond trained sequence lengths.
  • ⚑ A 1.3-billion-parameter ALiBi model trained on 1024 tokens achieved the same perplexity as a sinusoidal model trained on 2048 tokens, while training 11% faster and using 11% less memory.
  • βœ… ALiBi outperforms other strong positional methods, including the T5 bias method, which suffers from slow training and inference speeds.

Efficiency and Popularity

  • πŸ› οΈ ALiBi introduces no additional parameters and requires only a negligible amount of extra memory (e.g., 100 MB for 1024-length sequences).
  • πŸ† ALiBi models are shown to be better than sinusoidal models even when trained on fewer tokens, offering improved quality for extrapolation.
  • 🌐 The ALiBi method has become very popular and is now used in several large language models to encode positional signals.
Knowledge graph39 entities Β· 34 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover Β· drag to explore
39 entities
Chapters5 moments

Key Moments

Transcript43 segments

Full Transcript

Topics15 themes

What’s Discussed

Attention with Linear Biases (ALiBi)Transformer modelsPositional embeddingsSequence extrapolationQuery-key attentionPerplexitySinusoidal embeddingsRotary position embeddingsInductive biasRecency biasLarge language modelsWikiText-103Training efficiencyMemory usageT5 bias method
Smart Objects39 Β· 34 links
ConceptsΒ· 32
MediasΒ· 2
PersonΒ· 1
ProductsΒ· 2
CompaniesΒ· 2