chaohuang
chaohuang

Reputation: 4115

Resume training with multi_gpu_model in Keras

I'm training a modified InceptionV3 model with the multi_gpu_model in Keras, and I use model.save to save the whole model.

Then I closed and restarted the IDE and used load_model to reinstantiate the model.

The problem is that I am not able to resume the training exactly where I left off.

Here is the code:

parallel_model = multi_gpu_model(model, gpus=2)

parallel_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

history = parallel_model.fit_generator(generate_batches(path), steps_per_epoch = num_images/batch_size, epochs = num_epochs)

model.save('my_model.h5')

Before the IDE closed, the loss is around 0.8.

After restarting the IDE, reloading the model and re-running the above code, the loss became 1.5.

But, according to the Keras FAQ, model_save should save the whole model (architecture + weights + optimizer state), and load_model should return a compiled model that is identical to the previous one.

So I don't understand why the loss becomes larger after resuming the training.

EDIT: If I don't use the multi_gpu_model and just use the ordinary model, I'm able to resume exactly where I left off.

Upvotes: 3

Views: 1089

Answers (2)

Chuong Nguyen
Chuong Nguyen

Reputation: 39

@saul19am When you compile it, you can only load the weights and the model structure, but you still lose the optimizer_state. I think this can help.

Upvotes: 0

saul19am
saul19am

Reputation: 11

When you call multi_gpu_model(...), Keras automatically sets the weights of your model to some default values (at least in the version 2.2.0 which I am currently using). That's why you were not able to resume the training at the same point as it was when you saved it.

I just solved the issue by replacing the weights of the parallel model with the weights from the sequential model:

parallel_model = multi_gpu_model(model, gpus=2)

parallel_model.layers[-2].set_weights(model.get_weights()) # you can check the index of the sequential model with parallel_model.summary()

parallel_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

history = parallel_model.fit_generator(generate_batches(path), steps_per_epoch = num_images/batch_size, epochs = num_epochs)

I hope this will help you.

Upvotes: 1

Related Questions