Why does the Keras implementation for the Adam optimizer have the decay argument and Tensorflow doesn't?

Why does the Keras implementation for the Adam optimizer have the decay argument and Tensorflow doesn't? And what idea of this argument?

Upvotes: 0

Answers (2)

mrk

Reputation: 10366

The differences might somehow reflect the discussion whether learning rate decay is even needed when applying Adam.

Adam updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.
The single learning rates for parameters is computed using the initial learning rate as upper limit. This means that every single learning rate can vary from 0 (no update) to the initial learning rate.
The learning rates adapt themselves during train steps, but if you want to be sure that every update step do not exceed an upper Limit you can than lower your initial (global) learning rate, using exponential decay.

So these reasons show why there is a discussion whether learning rate decay with Adam is necessary after all.

Upvotes: 0

Daniel Möller

Reputation: 86600

Why is very hard to answer.

It's interesting to have a decay, though, for when your train reaches a limit. Lowering learning rates may improve your model with finer results. But machine learning is all about testing.

The idea is simply reduce the value of the learning rate in every batch update.

This is the formula Keras uses:

lr = self.lr
if self.initial_decay > 0:
    lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))

Basically its:

lr / (1 + decay*currentBatch) #considering currentBatch keeps incresing, not looping

Upvotes: 1

Why does the Keras implementation for the Adam optimizer have the decay argument and Tensorflow doesn&#39;t?

Answers (2)

Related Questions

Why does the Keras implementation for the Adam optimizer have the decay argument and Tensorflow doesn't?