Reputation: 13003
Currently I make checkpoints during training like this (pseudocode):
while(training):
model.train()
if it_is_time_for_validation():
metrics = model.validate()
if metrics.are_good():
saver = tf.train.Saver()
res = saver.save(sess=session, save_path=checkpoint_file_path)
Saver.save
method blocks for I/O, preventing next iterations from running.
My model's weights size is hundreds of megabytes and it takes a while to write all this stuff.
By my calculations, depending on checkpoint frequency, overall, GPU spends 5-10% time waiting for checkpoints to finish, instead of doing useful calculations. (5-10% is an equivalent of a day of calculations)
Is there a way to perform checkpoints asynchronously to reduce the waste of computational time?
Implementation sketch: first we might copy everything necessary from the device memory to host, and perform disk I/O on a separate thread. Saver.save
would return after memcopy, without waiting for disk operations, as it is safe to train the device copy now without screwing up the checkpoint. Saver.save
would still block on re-entry if there is I/O pending from the previous iteration.
I don't think it's currently implemented, so I am interested in possible workarounds as well. Is this idea nice enough to be a feature request on GitHub?
Upvotes: 1
Views: 918
Reputation: 126154
You can write checkpoints asynchronously by running saver.save()
in a separate thread. The (internal) SVTimerCheckpointThread
is an example of code that runs saver.save()
periodically in the background of training. Note that the tf.train.Supervisor
is a utility class that helps with managing such background threads (also for writing TensorBoard summary logs, etc.), so you might want to use it instead.
Upvotes: 2