Optimizers

Adadelta

Be the first to contribute!

Adagrad

Be the first to contribute!

Adam

Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum. It computes adaptive learning rates for each parameter and works as follows.

  • First, it computes the exponentially weighted average of past gradients (\(v_{dW}\)).
  • Second, it computes the exponentially weighted average of the squares of past gradients (\(s_{dW}\)).
  • Third, these averages have a bias towards zero and to counteract this a bias correction is applied (\(v_{dW}^{corrected}\), \(s_{dW}^{corrected}\)).
  • Lastly, the parameters are updated using the information from the calculated averages.
\[\begin{split}v_{dW} = \beta_1 v_{dW} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W } \\ s_{dW} = \beta_2 s_{dW} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W })^2 \\ v^{corrected}_{dW} = \frac{v_{dW}}{1 - (\beta_1)^t} \\ s^{corrected}_{dW} = \frac{s_{dW}}{1 - (\beta_1)^t} \\ W = W - \alpha \frac{v^{corrected}_{dW}}{\sqrt{s^{corrected}_{dW}} + \varepsilon}\end{split}\]

Примечание

  • \(v_{dW}\) - the exponentially weighted average of past gradients
  • \(s_{dW}\) - the exponentially weighted average of past squares of gradients
  • \(\beta_1\) - hyperparameter to be tuned
  • \(\beta_2\) - hyperparameter to be tuned
  • \(\frac{\partial \mathcal{J} }{ \partial W }\) - cost gradient with respect to current layer
  • \(W\) - the weight matrix (parameter to be updated)
  • \(\alpha\) - the learning rate
  • \(\epsilon\) - very small value to avoid dividing by zero

BFGS

Be the first to contribute!

Momentum

Used in conjunction Stochastic Gradient Descent (sgd) or Mini-Batch Gradient Descent, Momentum takes into account past gradients to smooth out the update. This is seen in variable \(v\) which is an exponentially weighted average of the gradient on previous steps. This results in minimizing oscillations and faster convergence.

\[\begin{split}v_{dW} = \beta v_{dW} + (1 - \beta) \frac{\partial \mathcal{J} }{ \partial W } \\ W = W - \alpha v_{dW}\end{split}\]

Примечание

  • \(v\) - the exponentially weighted average of past gradients
  • \(\frac{\partial \mathcal{J} }{ \partial W }\) - cost gradient with respect to current layer weight tensor
  • \(W\) - weight tensor
  • \(\beta\) - hyperparameter to be tuned
  • \(\alpha\) - the learning rate

RMSProp

Another adaptive learning rate optimization algorithm, Root Mean Square Prop (RMSProp) works by keeping an exponentially weighted average of the squares of past gradients. RMSProp then divides the learning rate by this average to speed up convergence.

\[\begin{split}s_{dW} = \beta s_{dW} + (1 - \beta) (\frac{\partial \mathcal{J} }{\partial W })^2 \\ W = W - \alpha \frac{\frac{\partial \mathcal{J} }{\partial W }}{\sqrt{s^{corrected}_{dW}} + \varepsilon}\end{split}\]

Примечание

  • \(s\) - the exponentially weighted average of past squares of gradients
  • \(\frac{\partial \mathcal{J} }{\partial W }\) - cost gradient with respect to current layer weight tensor
  • \(W\) - weight tensor
  • \(\beta\) - hyperparameter to be tuned
  • \(\alpha\) - the learning rate
  • \(\epsilon\) - very small value to avoid dividing by zero

SGD

Stochastic Gradient Descent.

def SGD(data, batch_size, lr):
    N = len(data)
    np.random.shuffle(data)
    mini_batches = np.array([data[i:i+batch_size]
     for i in range(0, N, batch_size)])
    for X,y in mini_batches:
        backprop(X, y, lr)

References

[1]http://sebastianruder.com/optimizing-gradient-descent/
[2]http://www.deeplearningbook.org/contents/optimization.html
[3]https://arxiv.org/pdf/1502.03167.pdf