Keras loss becomes nan only at epoch end

Question

I am observing some strange behavior from Keras. I am training a small model where the training loss becomes nan only at the end of the first epoch.

So if I have 100 batches, and I terminate training at batch 99, then resume for another 99 it trains fine. Otherwise, once it reaches the end of an epoch it always returns nan.

I am using a custom loss function:

def corr(x, y):
    xc = x - K.mean(x)
    yc = y - K.mean(y)
    r_num = K.mean(xc*yc) 
    r_den = K.std(x)*K.std(y)
    return r_num/r_den

And I have tried all of the standard tricks like dropping my learning rate, clipping the norm and value of my gradient, and increasing batch size. Only in the event of increasing my batch size to something unrealistic like 100,000 (I have 1 million data points) does it actually continue past an epoch, but I would like to understand what is going on at the end that is causing this strange behavior. I also tried different optimizers (currently using Adam), and tried this on different systems to make sure it wasn't a problem on my one computer.

My input and output is one dimensional and my model is summarized below.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_7 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               200       
_________________________________________________________________
dense_8 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 101       
=================================================================
Total params: 10,401
Trainable params: 10,401
Non-trainable params: 0
_________________________________________________________________

Does Keras so something special at the end of an epoch? I couldn't find anything other than the standard logger callback. I also wrote a custom callback which evaluates my model each batch and stores the output, and when I plot it over time it does not appear to blow up or do anything strange. It just looks like it's slowly improving, then the training dies.

Keras loss becomes nan only at epoch end

Answers (1)

Related Questions