Reputation: 803
I need to train a CNN that will take 1-2 days to train on a remotely accessed GPU server.
Will I simply need to leave my laptop on overnight for the training to be complete or is there a way to save the state of the training and resume from there the next day?
(Implementation in pytorch)
Upvotes: 0
Views: 1109
Reputation: 521
I assume you ssh into you remote server. When training the model by running your script, say, $ python train.py
, simply pre-append nohup
:
$ nohup python train.py
This tells your process to disregard the hangup signal when you exit the ssh session and shut down your laptop.
Upvotes: 2
Reputation: 286
If you need to keep training the model that you are about to save, you need to save more than just the model. You also need to save the state of the optimizer, epochs, score, etc. You would do it like this:
state = {
'epoch': epoch,
'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict(),
...
}
torch.save(state, filepath)
To resume training you would do things like: state = torch.load(filepath)
, and then, to restore the state of each individual object, something like this:
model.load_state_dict(state['state_dict'])
optimizer.load_state_dict(stata['optimizer'])
Since you are resuming training, DO NOT call model.eval() once you restore the states when loading.
To read more about this or see actual examples: https://www.programcreek.com/python/example/101175/torch.save
Upvotes: 2