📈Nonlinear Optimization Unit 4 Review

4.3 Momentum and adaptive learning rate techniques

📈Nonlinear Optimization
Unit 4 Review

4.3 Momentum and adaptive learning rate techniques

Written by the Fiveable Content Team • Last updated September 2025

📈Nonlinear Optimization

Unit & Topic Study Guides

4.1 Steepest descent algorithm

4.2 Conjugate gradient methods

4.3 Momentum and adaptive learning rate techniques

Gradient descent methods are crucial in optimization, but they can be slow or unstable. Momentum and adaptive learning rate techniques aim to fix these issues. They speed up convergence and handle different parameter scales more effectively.

These advanced methods build on basic gradient descent. Momentum helps overcome local minima, while adaptive techniques like AdaGrad, RMSprop, and Adam adjust learning rates for each parameter. They're essential tools for tackling complex optimization problems in machine learning.

Momentum-based Techniques

Understanding Momentum in Optimization

Momentum technique accelerates gradient descent by adding a fraction of the previous update vector to the current update
Helps overcome small local minima and speeds up convergence in ravines
Uses an exponentially weighted moving average of past gradients
Momentum parameter $\beta$ controls the contribution of past gradients, typically set between 0.9 and 0.99
Update rule for momentum: $v_t = \beta v_{t-1} + (1-\beta)\nabla f(\theta_t)$
Parameter update: $\theta_{t+1} = \theta_t - \alpha v_t$
Reduces oscillations in directions with high curvature
Momentum can overshoot the minimum in some cases, requiring careful tuning

Nesterov Accelerated Gradient (NAG)

Variation of momentum that provides a correction factor to the standard momentum approach
Calculates the gradient at the "looked-ahead" position instead of the current position
Update rule: $v_t = \beta v_{t-1} + (1-\beta)\nabla f(\theta_t + \beta v_{t-1})$
Parameter update: $\theta_{t+1} = \theta_t - \alpha v_t$
Provides increased responsiveness by considering the approximate future position
Often results in faster convergence and improved performance compared to standard momentum
Particularly effective for problems with high curvature or multiple local minima
Requires slightly more computation per iteration than standard momentum

Adaptive Learning Rate Methods

AdaGrad: Adaptive Gradient Algorithm

Adapts learning rates for each parameter individually based on historical gradients
Accumulates squared gradients for each parameter: $G_t = G_{t-1} + (\nabla f_t(\theta_t))^2$
Update rule: $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla f_t(\theta_t)$
$\epsilon$ serves as a small constant to avoid division by zero (typically $10^{-8}$ )
Effectively gives larger updates for infrequent parameters and smaller updates for frequent ones
Works well for sparse data and non-stationary objectives
Can lead to premature stopping of learning for some parameters due to accumulation of gradients

RMSprop: Root Mean Square Propagation

Addresses the diminishing learning rates problem of AdaGrad
Uses an exponentially weighted moving average of squared gradients
Update rule for accumulator: $G_t = \beta G_{t-1} + (1-\beta)(\nabla f_t(\theta_t))^2$
Parameter update: $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla f_t(\theta_t)$
Decay factor $\beta$ typically set to 0.9
Adapts learning rates based on recent gradients, allowing for continued learning
Performs well in online and non-stationary settings
Effectively deals with different scales of gradients across parameters

Adam: Adaptive Moment Estimation

Combines ideas from momentum and RMSprop
Maintains both a moving average of gradients (first moment) and squared gradients (second moment)
First moment update: $m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla f_t(\theta_t)$
Second moment update: $v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla f_t(\theta_t))^2$
Applies bias correction to moments: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$ , $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$
Parameter update: $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t$
Default values: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$
Combines benefits of both momentum and adaptive learning rates
Works well for a wide range of problems and is often considered a good default choice

Learning Rate Scheduling

Techniques to adjust the learning rate during training
Step decay reduces learning rate by a factor after a fixed number of epochs
Exponential decay continuously decreases learning rate: $\alpha_t = \alpha_0 \cdot e^{-kt}$
Cosine annealing varies learning rate cyclically following a cosine function
Warm-up phase gradually increases learning rate at the beginning of training
Learning rate schedules can be combined with adaptive methods for improved performance
Cyclical learning rates alternate between increasing and decreasing learning rates
Can help escape local minima and improve generalization

Gradient Control Strategies

Gradient Clipping Techniques

Addresses the exploding gradient problem in deep neural networks
Norm clipping scales down gradients when their L2 norm exceeds a threshold
Gradient norm clipping formula: $\nabla f_t(\theta_t) \leftarrow \min\left(1, \frac{c}{\|\nabla f_t(\theta_t)\|}\right) \nabla f_t(\theta_t)$
Value clipping directly clips each gradient element to a specified range (element-wise clipping)
Helps stabilize training, especially in recurrent neural networks
Allows for larger learning rates without causing divergence
Can be applied globally to all parameters or separately for each layer
May introduce bias in the optimization process but often improves training stability

Second-Order Optimization Methods

Utilize second derivatives (Hessian matrix) to improve optimization
Newton's method uses the inverse Hessian: $\theta_{t+1} = \theta_t - [H f(\theta_t)]^{-1}\nabla f(\theta_t)$
Quasi-Newton methods (BFGS, L-BFGS) approximate the inverse Hessian
Conjugate gradient method iteratively improves search directions
Natural gradient descent uses the Fisher information matrix instead of the Hessian
Second-order methods often converge in fewer iterations than first-order methods
Computationally expensive for high-dimensional problems
Limited-memory variants (L-BFGS) reduce memory requirements for large-scale problems
Can be combined with stochastic approaches for better scalability in machine learning

📈Nonlinear Optimization Unit 4 Review

4.3 Momentum and adaptive learning rate techniques

📈Nonlinear Optimization
Unit 4 Review

4.3 Momentum and adaptive learning rate techniques

Unit & Topic Study Guides

Momentum-based Techniques

Understanding Momentum in Optimization

Nesterov Accelerated Gradient (NAG)

Adaptive Learning Rate Methods

AdaGrad: Adaptive Gradient Algorithm

RMSprop: Root Mean Square Propagation

Adam: Adaptive Moment Estimation

Learning Rate Scheduling

Gradient Control Strategies

Gradient Clipping Techniques

Second-Order Optimization Methods

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes