Practical Remedies for Vanishing & Exploding Gradients

[HPP] Kaiming HeFebruary 9, 202632 min

23 connections·40 entities in this video→

Capture as you watch

Save any video to veridive in one click.

The free veridive Chrome extension pulls the transcript from any YouTube video or podcast you're watching — ready to ask, cite, and connect.

Understanding Gradient Instability

📉 Vanishing gradients occur when gradients shrink to zero as they propagate to earlier layers, causing these layers to learn very slowly or not at all.
💥 Exploding gradients happen when gradients grow exponentially large, leading to chaotic, destabilized training with loss spikes, huge parameters, and numerical overflow.
📊 Visualizations show that vanishing gradients cause norms to drop to zero in early layers, while exploding gradients result in an exponential increase in norms in earlier layers.

Core Concepts of Gradient Flow

🧠 Simply adding more layers to a neural network (the scaling hypothesis) does not guarantee better performance; gradient health is a more crucial factor for effective learning.
🔬 In both scalar and matrix-based models, the gradient magnitude scales with depth, determined by the product of layer-wise norms (weight matrix times Jacobian).
✅ The goal is to ensure that the norm of each layer's weight matrix times its Jacobian is approximately one, preventing gradients from collapsing or exploding.

Practical Remedies for Unstable Gradients

🔑 Principled initialization schemes like Xavier and He/Kaiming control the variance of weight matrices, ensuring forward activations and backward gradients remain in healthy ranges across layers.
⚖️ Normalization layers such as BatchNorm and LayerNorm re-center and rescale activations layer by layer, controlling their means and variances to enable more stable optimization and flexible learning rates.
🚀 Residual connections allow gradients to flow along an identity path, which is crucial for training very deep networks like ResNets by mitigating gradient degradation.

Diagnosing and Mitigating Issues

📈 Diagnosing unstable training involves observing training curve loss for smoothness, tracking per-layer gradient norms (zero for vanishing, large for exploding), and assessing learning rate sensitivity.
⚠️ The standard deviation of initial weights significantly impacts gradient norms; very small values lead to vanishing, very large values lead to exploding, while a moderate value (e.g., 0.1 in an example) can lead to stable norms.
🛠️ General strategies for stable training include using appropriate activation functions, good initializations, adding residual connections, considering normalization layers, and employing robust optimizers.

Ask, don't scrub

Discover the spoken web.

veridive answers questions with exact timestamps and citations — across every podcast, video, and article you've saved.

Knowledge graph40 entities · 23 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore

40 entities

Chapters15 moments

Key Moments

Transcript118 segments

Full Transcript

Follow the thread

Find every place these ideas show up.

veridive maps the same people, claims, and topics across thousands of sources — so you can trace an idea from one conversation to the next.

Topics13 themes

What’s Discussed

Vanishing gradientsExploding gradientsNeural network trainingGradient flowWeight initializationBatch NormalizationLayer NormalizationResidual connectionsActivation functionsLearning ratesOptimizersJacobian matricesTraining stability

Smart Objects40 · 23 links

Concepts· 23

Locations· 2

Products· 6

Companies· 3

Medias· 6

Hours of content, seconds to the answer.

Save what you listen to. Ask it anything. Watch the threads between sources surface on their own.

Get started free