Deeplearning4j Updaters Explained

This page and the explanations that follow assume that readers know how Stochastic Gradient Descent works.

The main difference among the updaters described below is how they treat the learning rate.

Stochastic Gradient Descent

Alt text

Theta (weights) is changed according to the gradient of the loss with respect to each theta.

alpha is the learning rate. If alpha is very small, convergence on an error minimum will be slow. If it is very large, the model will diverge away from the error minimum, and learning will cease.

Now, the gradient of the loss (L) changes quickly after each iteration due to variance among training examples. Look at the convergence path below. The updater takes small steps, but those steps zig-zag back and forth on their way to an error minimum.

Alt text

Momentum

To stop the zig-zagging, we use momentum. Momentum applies its knowledge from previous steps to where the updater should go. To represent it, we use a new hyperparameter μ, or “mu”.

Alt text

We’ll use the concept of momentum again later. (Don’t confuse it with moment, of which more below.)

Alt text

The image above represents SGD using momentum.

Adagrad

Adagrad scales alpha for each parameter according to the history of gradients (previous steps) for that parameter. That’s basically done by dividing the current gradient in the update rule by the sum of previous gradients. As a result, when the gradient is very large, alpha is reduced, and vice-versa.

Alt text

RMSProp

The only difference between RMSProp and Adagrad is that the g_t term is calculated by exponentially decaying the average and not the sum of gradients.

Alt text

Here g_t is called the second order moment of δL. Additionally, a first-order moment m_t can also be introduced.

Alt text

Adding momentum, as in the first case…

Alt text

…and finally collecting a new theta as we did in the first example.

Alt text

AdaDelta

AdaDelta also uses an exponentially decaying average of g_t, which was our second moment of gradient. But without using the alpha we typically use as learning rate, it introduces x_t, which is the second moment of v_t.

Alt text

ADAM

ADAM uses both first-order moment mt and second-order moment g_t, but they both decay over time. Step size is approximately ±α. Step size will decrease as we approach the error minimum.

Alt text

Chat with us on Gitter