Deeplearning4j Updaters Explained
This page and the explanations that follow assume that readers know how Stochastic Gradient Descent works.
The main difference among the updaters described below is how they treat the learning rate.
Stochastic Gradient Descent
Theta (weights) is changed according to the gradient of the loss with respect to each theta.
alpha is the learning rate. If alpha is very small, convergence on an error minimum will be slow. If it is very large, the model will diverge away from the error minimum, and learning will cease.
Now, the gradient of the loss (L) changes quickly after each iteration due to variance among training examples. Look at the convergence path below. The updater takes small steps, but those steps zig-zag back and forth on their way to an error minimum.
To stop the zig-zagging, we use momentum. Momentum applies its knowledge from previous steps to where the updater should go. To represent it, we use a new hyperparameter
μ, or “mu”.
We’ll use the concept of momentum again later. (Don’t confuse it with moment, of which more below.)
The image above represents SGD using momentum.
Adagrad scales alpha for each parameter according to the history of gradients (previous steps) for that parameter. That’s basically done by dividing the current gradient in the update rule by the sum of previous gradients. As a result, when the gradient is very large, alpha is reduced, and vice-versa.
The only difference between RMSProp and Adagrad is that the
g_t term is calculated by exponentially decaying the average and not the sum of gradients.
g_t is called the second order moment of
δL. Additionally, a first-order moment
m_t can also be introduced.
Adding momentum, as in the first case…
…and finally collecting a new
theta as we did in the first example.
AdaDelta also uses an exponentially decaying average of
g_t, which was our second moment of gradient. But without using the alpha we typically use as learning rate, it introduces
x_t, which is the second moment of
ADAM uses both first-order moment mt and second-order moment
g_t, but they both decay over time. Step size is approximately
±α. Step size will decrease as we approach the error minimum.