Reputation: 2507
I trained a text classification model consisting RNN in Tensorflow 2.0 with Keras API. I trained this model on multiple GPUs(2) using tf.distribute.MirroredStrategy()
from here. I saved the checkpoint of the model using tf.keras.callbacks.ModelCheckpoint('file_name.h5')
after every epoch.
Now, I want to continue training where I left off on same number of GPUs from the last checkpoint I saved. After loading the checkpoint inside tf.distribute.MirroredStrategy()
like this-
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model =tf.keras.models.load_model('file_name.h5')
, it is throwing following error.
File "model_with_tfsplit.py", line 94, in <module>
model =tf.keras.models.load_model('TF_model_onfull_2_03.h5') # Loading for retraining
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 138, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
model._make_train_function()
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2015, in _make_train_function
params=self._collected_trainable_weights, loss=self.total_loss)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 500, in get_updates
grads = self.get_gradients(loss, params)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
grads = gradients.gradients(loss, params)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 541, in _GradientsHelper
for x in xs
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/distribute/values.py", line 716, in handle
raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call
Now I am not sure where the problem is. ALso, if I do not use this mirror strategy for using multiple GPUs, then the training starts from beginning but after few steps it reaches the same accuracy and loss value like before the model was saved. Although not sure if this behaviour is normal or not.
Thank You! Rishabh Sahrawat
Upvotes: 5
Views: 2910
Reputation: 3160
I solved it similar to @Srihari Humbarwadi but with the difference of moving the strategy scope inside the get_model function. it is described similar at TF's docu:
def get_model(strategy):
with strategy.scope():
...
return model
and call it before training with:
strategy = tf.distribute.MirroredStrategy()
model = get_model(strategy)
model.load_weights('file_name.h5')
unfortunately just calling
model =tf.keras.models.load_model('file_name.h5')
does not enable multi GPU training. My guess is that its related to the .h5
model format. maybe it works with tensorflow native .pb
format.
Upvotes: 1
Reputation: 2642
Create the model under the distributed scope and then use load_weights
method.
In this example get_model
return an instance of tf.keras.Model
def get_model():
...
return model
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = get_model()
model.load_weights('file_name.h5')
model.compile(...)
model.fit(...)
Upvotes: 1