Reputation: 919
I am observing some strange behavior from Keras. I am training a small model where the training loss becomes nan only at the end of the first epoch.
So if I have 100 batches, and I terminate training at batch 99, then resume for another 99 it trains fine. Otherwise, once it reaches the end of an epoch it always returns nan.
I am using a custom loss function:
def corr(x, y):
xc = x - K.mean(x)
yc = y - K.mean(y)
r_num = K.mean(xc*yc)
r_den = K.std(x)*K.std(y)
return r_num/r_den
And I have tried all of the standard tricks like dropping my learning rate, clipping the norm and value of my gradient, and increasing batch size. Only in the event of increasing my batch size to something unrealistic like 100,000 (I have 1 million data points) does it actually continue past an epoch, but I would like to understand what is going on at the end that is causing this strange behavior. I also tried different optimizers (currently using Adam), and tried this on different systems to make sure it wasn't a problem on my one computer.
My input and output is one dimensional and my model is summarized below.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_7 (InputLayer) (None, 1) 0
_________________________________________________________________
dense_7 (Dense) (None, 100) 200
_________________________________________________________________
dense_8 (Dense) (None, 100) 10100
_________________________________________________________________
dense_9 (Dense) (None, 1) 101
=================================================================
Total params: 10,401
Trainable params: 10,401
Non-trainable params: 0
_________________________________________________________________
Does Keras so something special at the end of an epoch? I couldn't find anything other than the standard logger callback. I also wrote a custom callback which evaluates my model each batch and stores the output, and when I plot it over time it does not appear to blow up or do anything strange. It just looks like it's slowly improving, then the training dies.
Upvotes: 1
Views: 1027
Reputation: 33460
Probably it is caused by a division by zero in the loss function. Make sure the denominator is always positive by adding a small constant to it. You can use K.epsilon()
for this purpose:
return r_num / (r_den + K.epsilon())
Upvotes: 2