An adaptive threshold algorithm used to determine the encoding threshold for distributed training.
The idea: the threshold can be too high or too low for optimal training - both cases are bad.
So instead, we'll define a range of "acceptable" sparsity ratio values (default: 1e-4 to 1e-2).
The sparsity ratio is defined as numValues(encodedUpdate)/numParameters
If the sparsity ratio falls outside of this acceptable range, we'll either increase or decrease the threshold.
The threshold changed multiplicatively using the decay rate:
To increase threshold:
newThreshold = decayRate * threshold
To decrease threshold:
newThreshold = (1.0/decayRate) * threshold
The default decay rate used is
=0.965936 which corresponds to an a maximum increase or
decrease of the threshold by a factor of:
* 2.0 in 20 iterations
* 100 in 132 iterations
* 1000 in 200 iterations
A high threshold leads to few values being encoded and communicated - a small "sparsity ratio".
Too high threshold (too low sparsity ratio): fast network communication but slow training (few parameter updates being communicated).
A low threshold leads to many values being encoded and communicated - a large "sparsity ratio".
Too low threshold (too high sparsity ratio): slower network communication and maybe slow training (lots of parameter updates
being communicated - but they are all very small, changing network's predictions only a tiny amount).
A sparsity ratio of 1.0 means all values are present in the encoded update vector.
A sparsity ratio of 0.0 means all values were excluded from the encoded update vector.