Optimizers¶
Adadelta¶
Be the first to contribute!
Adagrad¶
Be the first to contribute!
Adam¶
Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum. It computes adaptive learning rates for each parameter and works as follows.
- First, it computes the exponentially weighted average of past gradients (\(v_{dW}\)).
- Second, it computes the exponentially weighted average of the squares of past gradients (\(s_{dW}\)).
- Third, these averages have a bias towards zero and to counteract this a bias correction is applied (\(v_{dW}^{corrected}\), \(s_{dW}^{corrected}\)).
- Lastly, the parameters are updated using the information from the calculated averages.
Примечание
- \(v_{dW}\) - the exponentially weighted average of past gradients
- \(s_{dW}\) - the exponentially weighted average of past squares of gradients
- \(\beta_1\) - hyperparameter to be tuned
- \(\beta_2\) - hyperparameter to be tuned
- \(\frac{\partial \mathcal{J} }{ \partial W }\) - cost gradient with respect to current layer
- \(W\) - the weight matrix (parameter to be updated)
- \(\alpha\) - the learning rate
- \(\epsilon\) - very small value to avoid dividing by zero
Conjugate Gradients¶
Be the first to contribute!
BFGS¶
Be the first to contribute!
Momentum¶
Used in conjunction Stochastic Gradient Descent (sgd) or Mini-Batch Gradient Descent, Momentum takes into account past gradients to smooth out the update. This is seen in variable \(v\) which is an exponentially weighted average of the gradient on previous steps. This results in minimizing oscillations and faster convergence.
Примечание
- \(v\) - the exponentially weighted average of past gradients
- \(\frac{\partial \mathcal{J} }{ \partial W }\) - cost gradient with respect to current layer weight tensor
- \(W\) - weight tensor
- \(\beta\) - hyperparameter to be tuned
- \(\alpha\) - the learning rate
Nesterov Momentum¶
Be the first to contribute!
Newton’s Method¶
Be the first to contribute!
RMSProp¶
Another adaptive learning rate optimization algorithm, Root Mean Square Prop (RMSProp) works by keeping an exponentially weighted average of the squares of past gradients. RMSProp then divides the learning rate by this average to speed up convergence.
Примечание
- \(s\) - the exponentially weighted average of past squares of gradients
- \(\frac{\partial \mathcal{J} }{\partial W }\) - cost gradient with respect to current layer weight tensor
- \(W\) - weight tensor
- \(\beta\) - hyperparameter to be tuned
- \(\alpha\) - the learning rate
- \(\epsilon\) - very small value to avoid dividing by zero
SGD¶
Stochastic Gradient Descent.
def SGD(data, batch_size, lr):
N = len(data)
np.random.shuffle(data)
mini_batches = np.array([data[i:i+batch_size]
for i in range(0, N, batch_size)])
for X,y in mini_batches:
backprop(X, y, lr)
References
[1] | http://sebastianruder.com/optimizing-gradient-descent/ |
[2] | http://www.deeplearningbook.org/contents/optimization.html |
[3] | https://arxiv.org/pdf/1502.03167.pdf |