Is there a way to write TensorFlow checkpoints asynchronously?

Question

Currently I make checkpoints during training like this (pseudocode):

while(training):
    model.train()

    if it_is_time_for_validation():
        metrics = model.validate()

        if metrics.are_good():
             saver = tf.train.Saver()
             res = saver.save(sess=session, save_path=checkpoint_file_path)

Saver.save method blocks for I/O, preventing next iterations from running. My model's weights size is hundreds of megabytes and it takes a while to write all this stuff.

By my calculations, depending on checkpoint frequency, overall, GPU spends 5-10% time waiting for checkpoints to finish, instead of doing useful calculations. (5-10% is an equivalent of a day of calculations)

Is there a way to perform checkpoints asynchronously to reduce the waste of computational time?

Implementation sketch: first we might copy everything necessary from the device memory to host, and perform disk I/O on a separate thread. Saver.save would return after memcopy, without waiting for disk operations, as it is safe to train the device copy now without screwing up the checkpoint. Saver.save would still block on re-entry if there is I/O pending from the previous iteration.

I don't think it's currently implemented, so I am interested in possible workarounds as well. Is this idea nice enough to be a feature request on GitHub?

Is there a way to write TensorFlow checkpoints asynchronously?

Answers (1)

Related Questions