Skip to main content

Weight Initialization: Solving Vanishing & Exploding Gradients in Deep Learning

[HPP] Kaiming HeSeptember 21, 20256 min
7 connections·11 entities in this video

The Challenge of Deep Learning Training

  • ⚠️ Many deep learning models fail to learn effectively, even with correct architecture and clean data, often due to poor initial weight settings.
  • 📉 Training can stall completely (loss flatlines) or break down (loss becomes 'not a number') if weights are not initialized properly.
  • 🎯 Weight initialization is crucial, setting the starting point for millions of network parameters, akin to a hiker's starting position on a map.

Understanding Gradient Problems

  • 📉 Vanishing gradients occur when the error signal weakens with each layer, causing early layers to stop learning and model accuracy to stall.
  • 💡 This issue is common with older activation functions like sigmoid and tanh, as the signal becomes static.
  • 💥 Exploding gradients happen when the error signal amplifies excessively, leading to chaotic weight updates and the model wildly overshooting solutions.
  • 📈 Loss values can become NaN (Not a Number), indicating a complete breakdown of the mathematical process.

Principles of Effective Initialization

  • ✅ The goal of good initialization is to create a stable, predictable environment for information flow through the network.
  • 📊 It aims to preserve the statistical properties of the signal, like variance, as it moves from layer to layer, keeping it constant.
  • ⛰️ This ensures the model takes confident, steady steps towards the lowest point in the loss landscape, enabling efficient learning.

Key Initialization Strategies

  • 🔑 Xavier (Glorot) initialization is highly effective for symmetric activation functions such as sigmoid and tanh.
  • 🚀 He (Kaiming) initialization was developed specifically for the popular ReLU activation function, accounting for its non-symmetric nature.
  • 💡 He initialization prevents the signal from dying by addressing the fact that ReLU zeros out half the information.

Impact of Proper Initialization

  • 📉 Poor initialization (e.g., all zeros or simple random) leads directly to vanishing or exploding gradients, causing training to fail.
  • ✅ Using the correct initialization strategy (e.g., He for ReLU) ensures stable signal flow and allows the model to converge quickly and reliably.
  • ⚡ Proper weight initialization is a fundamental technique that can significantly double training speed and lead to a more accurate final model.
Knowledge graph11 entities · 7 connections

How they connect

An interactive map of every person, idea, and reference from this conversation. Hover to trace connections, click to explore.

Hover · drag to explore
11 entities
Chapters4 moments

Key Moments

Transcript25 segments

Full Transcript

Topics16 themes

What’s Discussed

Deep LearningWeight InitializationNeural NetworksVanishing GradientsExploding GradientsActivation FunctionsSigmoid ActivationTanh ActivationReLU ActivationXavier InitializationGlorot InitializationHe InitializationKaiming InitializationGradient FlowModel PerformanceTraining Speed
Smart Objects11 · 7 links
Concepts· 9
People· 2