SwiGLU: Why Modern LLMs Ditch GELU/ReLU

[HPP] Noam ShazeerOctober 9, 20255 min

7 connections·11 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

SwiGLU's Impact on LLMs

🚀 Modern Large Language Models (LLMs) like PaLM and LLaMA standardize on SwiGLU due to its ability to deliver lower loss and better perplexity at a nearly identical computational cost compared to GELU.
💡 This advancement is not just theoretical; it provides empirical improvements in model performance while maintaining efficiency.

Limitations of Traditional Activations

🧠 Activation functions are crucial components within the Feedforward Network (FFN) block of a Transformer, located between attention layers.
⚠️ Older activations like ReLU and GELU are monotonic functions that transform every input dimension uniformly, lacking the ability to selectively gate features or dynamically choose which information to pass through.

The Gating Mechanism of GLU

🔑 The Gated Linear Unit (GLU) family, which includes SwiGLU, addresses these limitations by splitting the input into two paths: a gate and a value.
✨ By applying a function (like sigmoid or SiLU) to the gate and then performing an element-wise multiplication with the value, GLU allows the network to dynamically open or close channels for different features, enhancing expressivity.

Understanding SwiGLU's Formula

🔬 SwiGLU specifically uses the SiLU (Swish) function, defined as X times sigmoid(X), for its gating path, which provides smoother gradients and avoids dead neurons compared to ReLU.
📝 The core formula for SwiGLU involves SiLU(X * W1) multiplied element-wise by (X * W2), combining a smooth gating function with a linear value path.

Computational Efficiency with the 2/3 Width Trick

📊 To prevent a 50% increase in computational cost when adopting SwiGLU (which uses three projections), a clever "two-thirds width trick" is employed.
⚙️ The FFN width is adjusted to 8/3 of the model dimension (instead of 4 times for GELU), ensuring that the total parameter count and FLOPs remain constant despite the additional projection.

Implementation and Best Practices

🛠️ Implementing SwiGLU involves an initial projection that maps to twice the GLU width, splitting the result into gate and value halves, applying SiLU to the gate, multiplying element-wise, and then projecting back down.
✅ Key considerations include avoiding the naive 4x width (stick to 8/3), applying dropout after gating, and carefully initializing the gate bias to prevent gate collapse and ensure effective gating.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph11 entities · 7 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

11 entities

Chapters3 moments

Key Moments

Transcript21 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics15 themes

What’s Discussed

SwiGLULarge Language Models (LLMs)GELUReLUTransformer ArchitectureFeedforward Network (FFN)Gated Linear Units (GLU)SiLU (Swish) functionComputational CostModel PerplexityFLOPsDynamic Feature SelectionGradientsHyperparametersDropout

Smart Objects11 · 7 links

Concepts· 9

Companies· 2

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free