Reputation: 356
Is there a possibility to undo the last training step? For example when the loss value is 'NaN'.
...
for step in range(num_epoch):
_, loss_value = sess.run([train_op, loss])
if np.isnan(loss_value):
# something like: sess.undo_last()
break
...
If there is such a method. Does it also work for Multi GPU trainings?
Upvotes: 2
Views: 343
Reputation: 11968
There's no such thing, however you can do something like this. In your model add:
loss = tf.check_numerics(loss)
This will throw an InvalidArgument
exception if your loss becomes NaN or Inf. Since this is computed before you compute any backpropagation no weights are modified.
Your example code would look like:
for step in range(num_epoch):
try:
sess.run([train_op])
except InvalidArgument:
break
This will not help you though. Usually NaN or Inf loss means the model is already in a bad state. Try different activation functions or simpler models so that it doesn't go there.
Alternatively you can have checkpoints (save the model after every X steps) and look at picking a checkpoint before the error.
Upvotes: 1