Yousef
Yousef

Reputation: 403

freeing gpu temporarily in tensorflow or pytorch

I am using tensorflow to train my experiments, and some of them are lengthy and in the middle, I would like to test new implementations, but I need to stop the process, and return to it. It is not solved with checkpoint saving and loading. Is there any way to store the GPU condition and the process and restore it again? I tried kill -STOP, but it does not free up the GPU.

Upvotes: 1

Views: 85

Answers (1)

user11530462
user11530462

Reputation:

You can modify the epochs in model.fit() in such a way that it runs for few epochs and end. After that, you can continue the training from where you left using another model.fit() by setting the argument initial_epoch = history.epoch[-1].

For example -

Initially you trained for 10 epochs using below,

initial_epochs = 10
history = model.fit(train_batches,
                    epochs=initial_epochs,
                    validation_data=validation_batches)

Later you can comeback and train for another 10 epochs by using below,

fine_tune_epochs = 10
total_epochs =  initial_epochs + fine_tune_epochs

history_fine = model.fit(train_batches,
                         epochs=total_epochs,
                         initial_epoch =  history.epoch[-1],
                         validation_data=validation_batches)

You can find a well written example for this here. They are doing fine tuning using this procedure.

Also, as suggested by szymon-maszke, you can use model.save to save your model as model.save saves all information need for restarting training in your case after loading the model using load_model. This works fine with CPU, but I can see issues faced by users while saving, loading and retraining using GPU for this task in Github and Stackoverflow.

Also, you can have look into train_on_batch, train_on_batch trains using a single batch only and once. The idea of using train_on_batch is probably to do more things yourself between each batch. Not sure if it will be of much help here.

Meanwhile, you can use the first approach.

Hope this answers you question. Happy Learning.

Upvotes: 1

Related Questions