# Deeplearning4j Updaters Explained

This page and the explanations that follow assume that readers know how Stochastic Gradient Descent works.

The main difference among the updaters described below is how they treat the learning rate.

## Stochastic Gradient Descent

`Theta`

(weights) is changed according to the gradient of the loss with respect to each theta.

`alpha`

is the learning rate. If alpha is very small, convergence on an error minimum will be slow. If it is very large, the model will diverge away from the error minimum, and learning will cease.

Now, the gradient of the loss (L) changes quickly after each iteration due to variance among training examples. Look at the convergence path below. The updater takes small steps, but those steps zig-zag back and forth on their way to an error minimum.

## Momentum

To stop the zig-zagging, we use *momentum*. Momentum applies its knowledge from previous steps to where the updater should go. To represent it, we use a new hyperparameter `μ`

, or “mu”.

We’ll use the concept of momentum again later. (Don’t confuse it with moment, of which more below.)

The image above represents SGD using momentum.

## Adagrad

Adagrad scales alpha for each parameter according to the history of gradients (previous steps) for that parameter. That’s basically done by dividing the current gradient in the update rule by the sum of previous gradients. As a result, when the gradient is very large, alpha is reduced, and vice-versa.

## RMSProp

The only difference between RMSProp and Adagrad is that the `g_t`

term is calculated by exponentially decaying the average and not the sum of gradients.

Here `g_t`

is called the second order moment of `δL`

. Additionally, a first-order moment `m_t`

can also be introduced.

Adding momentum, as in the first case…

…and finally collecting a new `theta`

as we did in the first example.

## AdaDelta

AdaDelta also uses an exponentially decaying average of `g_t`

, which was our second moment of gradient. But without using the alpha we typically use as learning rate, it introduces `x_t`

, which is the second moment of `v_t`

.

## ADAM

ADAM uses both first-order moment mt and second-order moment `g_t`

, but they both decay over time. Step size is approximately `±α`

. Step size will decrease as we approach the error minimum.