SwiGLU: Why Modern LLMs Ditch GELU/ReLU
[HPP] Noam ShazeerOctober 9, 20255 min
7 connectionsΒ·11 entities in this videoβSwiGLU's Impact on LLMs
- π Modern Large Language Models (LLMs) like PaLM and LLaMA standardize on SwiGLU due to its ability to deliver lower loss and better perplexity at a nearly identical computational cost compared to GELU.
- π‘ This advancement is not just theoretical; it provides empirical improvements in model performance while maintaining efficiency.
Limitations of Traditional Activations
- π§ Activation functions are crucial components within the Feedforward Network (FFN) block of a Transformer, located between attention layers.
- β οΈ Older activations like ReLU and GELU are monotonic functions that transform every input dimension uniformly, lacking the ability to selectively gate features or dynamically choose which information to pass through.
The Gating Mechanism of GLU
- π The Gated Linear Unit (GLU) family, which includes SwiGLU, addresses these limitations by splitting the input into two paths: a gate and a value.
- β¨ By applying a function (like sigmoid or SiLU) to the gate and then performing an element-wise multiplication with the value, GLU allows the network to dynamically open or close channels for different features, enhancing expressivity.
Understanding SwiGLU's Formula
- π¬ SwiGLU specifically uses the SiLU (Swish) function, defined as X times sigmoid(X), for its gating path, which provides smoother gradients and avoids dead neurons compared to ReLU.
- π The core formula for SwiGLU involves
SiLU(X * W1)multiplied element-wise by(X * W2), combining a smooth gating function with a linear value path.
Computational Efficiency with the 2/3 Width Trick
- π To prevent a 50% increase in computational cost when adopting SwiGLU (which uses three projections), a clever "two-thirds width trick" is employed.
- βοΈ The FFN width is adjusted to 8/3 of the model dimension (instead of 4 times for GELU), ensuring that the total parameter count and FLOPs remain constant despite the additional projection.
Implementation and Best Practices
- π οΈ Implementing SwiGLU involves an initial projection that maps to twice the GLU width, splitting the result into gate and value halves, applying SiLU to the gate, multiplying element-wise, and then projecting back down.
- β Key considerations include avoiding the naive 4x width (stick to 8/3), applying dropout after gating, and carefully initializing the gate bias to prevent gate collapse and ensure effective gating.
Knowledge graph11 entities Β· 7 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
11 entities
Chapters3 moments
Key Moments
Transcript21 segments
Full Transcript
Topics15 themes
Whatβs Discussed
SwiGLULarge Language Models (LLMs)GELUReLUTransformer ArchitectureFeedforward Network (FFN)Gated Linear Units (GLU)SiLU (Swish) functionComputational CostModel PerplexityFLOPsDynamic Feature SelectionGradientsHyperparametersDropout
Smart Objects11 Β· 7 links
ConceptsΒ· 9
CompaniesΒ· 2