Drop
Drop

Reputation: 13003

Is there a way to write TensorFlow checkpoints asynchronously?

Currently I make checkpoints during training like this (pseudocode):

while(training):
    model.train()

    if it_is_time_for_validation():
        metrics = model.validate()

        if metrics.are_good():
             saver = tf.train.Saver()
             res = saver.save(sess=session, save_path=checkpoint_file_path)

Saver.save method blocks for I/O, preventing next iterations from running. My model's weights size is hundreds of megabytes and it takes a while to write all this stuff.

By my calculations, depending on checkpoint frequency, overall, GPU spends 5-10% time waiting for checkpoints to finish, instead of doing useful calculations. (5-10% is an equivalent of a day of calculations)

Is there a way to perform checkpoints asynchronously to reduce the waste of computational time?

Implementation sketch: first we might copy everything necessary from the device memory to host, and perform disk I/O on a separate thread. Saver.save would return after memcopy, without waiting for disk operations, as it is safe to train the device copy now without screwing up the checkpoint. Saver.save would still block on re-entry if there is I/O pending from the previous iteration.

I don't think it's currently implemented, so I am interested in possible workarounds as well. Is this idea nice enough to be a feature request on GitHub?

Upvotes: 1

Views: 918

Answers (1)

mrry
mrry

Reputation: 126154

You can write checkpoints asynchronously by running saver.save() in a separate thread. The (internal) SVTimerCheckpointThread is an example of code that runs saver.save() periodically in the background of training. Note that the tf.train.Supervisor is a utility class that helps with managing such background threads (also for writing TensorBoard summary logs, etc.), so you might want to use it instead.

Upvotes: 2

Related Questions