Reputation: 953
With TensorFlow, my model size(model.ckpt.data) is 88M when optimizer is tf.train.GradientDescentOptimizer
, but it turned to 220M when the optimizer changed to tf.train.AdamOptimizer
.
Why is there so huge a difference?
Upvotes: 1
Views: 271
Reputation: 56347
ADAM adds two running means (for gradient and square of gradient) as additional non-trainable parameters for each trainable parameter, meaning it increases the number of total parameters to three times. These non-trainable parameters are also saved as they are required to restart the learning process. That's why the model checkpoint is bigger.
Upvotes: 2