Practical Remedies for Vanishing & Exploding Gradients
[HPP] Kaiming HeFebruary 9, 202632 min
23 connectionsΒ·40 entities in this videoβUnderstanding Gradient Instability
- π Vanishing gradients occur when gradients shrink to zero as they propagate to earlier layers, causing these layers to learn very slowly or not at all.
- π₯ Exploding gradients happen when gradients grow exponentially large, leading to chaotic, destabilized training with loss spikes, huge parameters, and numerical overflow.
- π Visualizations show that vanishing gradients cause norms to drop to zero in early layers, while exploding gradients result in an exponential increase in norms in earlier layers.
Core Concepts of Gradient Flow
- π§ Simply adding more layers to a neural network (the scaling hypothesis) does not guarantee better performance; gradient health is a more crucial factor for effective learning.
- π¬ In both scalar and matrix-based models, the gradient magnitude scales with depth, determined by the product of layer-wise norms (weight matrix times Jacobian).
- β The goal is to ensure that the norm of each layer's weight matrix times its Jacobian is approximately one, preventing gradients from collapsing or exploding.
Practical Remedies for Unstable Gradients
- π Principled initialization schemes like Xavier and He/Kaiming control the variance of weight matrices, ensuring forward activations and backward gradients remain in healthy ranges across layers.
- βοΈ Normalization layers such as BatchNorm and LayerNorm re-center and rescale activations layer by layer, controlling their means and variances to enable more stable optimization and flexible learning rates.
- π Residual connections allow gradients to flow along an identity path, which is crucial for training very deep networks like ResNets by mitigating gradient degradation.
Diagnosing and Mitigating Issues
- π Diagnosing unstable training involves observing training curve loss for smoothness, tracking per-layer gradient norms (zero for vanishing, large for exploding), and assessing learning rate sensitivity.
- β οΈ The standard deviation of initial weights significantly impacts gradient norms; very small values lead to vanishing, very large values lead to exploding, while a moderate value (e.g., 0.1 in an example) can lead to stable norms.
- π οΈ General strategies for stable training include using appropriate activation functions, good initializations, adding residual connections, considering normalization layers, and employing robust optimizers.
Knowledge graph40 entities Β· 23 connections
How they connect
An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.
Hover Β· drag to explore
40 entities
Chapters15 moments
Key Moments
Transcript118 segments
Full Transcript
Topics13 themes
Whatβs Discussed
Vanishing gradientsExploding gradientsNeural network trainingGradient flowWeight initializationBatch NormalizationLayer NormalizationResidual connectionsActivation functionsLearning ratesOptimizersJacobian matricesTraining stability
Smart Objects40 Β· 23 links
ConceptsΒ· 23
LocationsΒ· 2
ProductsΒ· 6
CompaniesΒ· 3
MediasΒ· 6