Reputation: 655
When using one of adaptive optimizers (Adam, etc.) we expect changing learning rate for successive mini-batches during training inside epoch. But I wonder how the learning rate would change between successive epochs - would it be continued from previous epoch (expected behavior) or initialized from default?
Of course by term "rate" I mean the whole bunch of variables which particular optimizer uses to determine the actual weights update wrt gradient)
Also what would happen to the rate if I run training for N epochs, stop and then continue like this:
model.fit(data1_train_x,data1_train_y, \
initial_epoch=0, \
epochs=20, \
validation_split=0.1,\
batch_size=64, \
callbacks=[tensorboard])
model.fit(data2_train_x,data2_train_y, \
initial_epoch=20, \
epochs=40, \
validation_split=0.1,\
batch_size=64, \
callbacks=[tensorboard])
I think I"ll create callback to log the rate after each epoch and plot it, but before I do it, may be someone already has the answers.
Upvotes: 1
Views: 1704
Reputation: 77857
Summary
Rate changes do not reset; they continue smoothly across epochs in both cases.
Detail
Any well-behaved learning-rate decay function depends on the length of training, since iteration 0.
Note: you can write your own decay function; you can make it as deranged as you wish. One such alteration is
alpha = iteration_number
this will diverge before you get back with your coffee.
Some functions merely depend on the current state and a modifier, such as
if iteration_number % 5000 == 0:
alpha *= 0.9
Another consists of a semi-exponential decay, depending on the quantity of remaining iterations.
In any case, these do not reset at the start of every epoch. You can write one to reset, if you wish, but I don't recommend it.
Your two-stage example is no exception, because you've coded it properly: you have the second training segment start where the previous one left off. The critical clue here is the initial_epoch
parameter: you're telling the fitting function where to start the learning rate, rather than resetting to time zero.
Upvotes: 1