Reputation: 464
Why does the Keras implementation for the Adam optimizer have the decay argument and Tensorflow doesn't? And what idea of this argument?
Upvotes: 0
Views: 503
Reputation: 10366
The differences might somehow reflect the discussion whether learning rate decay is even needed when applying Adam.
So these reasons show why there is a discussion whether learning rate decay with Adam is necessary after all.
Upvotes: 0
Reputation: 86600
Why is very hard to answer.
It's interesting to have a decay, though, for when your train reaches a limit. Lowering learning rates may improve your model with finer results. But machine learning is all about testing.
The idea is simply reduce the value of the learning rate in every batch update.
This is the formula Keras uses:
lr = self.lr
if self.initial_decay > 0:
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
Basically its:
lr / (1 + decay*currentBatch) #considering currentBatch keeps incresing, not looping
Upvotes: 1