Save a tensorflow model after a fixed training time

Question

I'm training a model on a server that allows me only one hour of computation: At the end of that time, it will simply kill my job. I would like tensorflow to save the results of its training after, say, 58 minutes of training, no matter what is the current state. I'm OK with it saving the status at the last completed epoch, I just want to have an idea what's going on. How can I do that?

alessiosavi · Accepted Answer

Of course, you can define a callback function delegated to stop the training phase.

You can have a look here for further information:
https://towardsdatascience.com/neural-network-with-tensorflow-how-to-stop-training-using-callback-5c8d575c18a9

In this example, is created a callback function in order to stop the training phase when the 'ACCURACY' exceeds the threshold. You can modify the function in order to make a time computation in order to verify the elapsed time.

This is a working piece of code:

class TimeOut(Callback):
    def __init__(self, t0, timeout):
        super().__init__()
        self.t0 = t0
        self.timeout = timeout  # time in minutes

    def on_train_batch_end(self, batch, logs=None):
        if time.time() - self.t0 > self.timeout * 60:  # 58 minutes
            print(f"
Reached {(time.time() - self.t0) / 60:.3f} minutes of training, stopping")
            self.model.stop_training = True

callbacks = [TimeOut(t0=time.time(), timeout=58)]

Save a tensorflow model after a fixed training time

Answers (2)

Related Questions