Question on restoring training after loading model

Question

Having trained for 24 hours, the training process saved the model files via torch.save. There was a power-off or other issues caused the process exited. Normally, we can load the model and continue training from the last step.

Why should not we load the states of optimizers (Adam, etc), is it necessary?

Sagnik Mukherjee · Accepted Answer

Yes, you can load the model from the last step and retrain it from that very step.

if you want to use it only for inference, you will save the state_dict of the model as

torch.save(model, PATH)

And load it as

model = torch.load(PATH)
model.eval()

However, for your concern you need to save the optimizer state dict as well. For that purpose, you need to save it as

torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            ...
            }, PATH)

and load the model for further training as:

model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

model.eval()
# - or -
model.train()

It is necessary to save the optimizer state dictionary, since this contains buffers and parameters that are updated as the model trains.

Question on restoring training after loading model

Answers (2)

Related Questions