Swind D.C. Xu
Swind D.C. Xu

Reputation: 225

How to solve nan loss?

Problem

I'm running a Deep Neural Network on the MNIST where the loss defined as follow:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))

The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logits is giving me this error. This is strange, because the code just contains mul and add operations.

Possible Solution

Maybe I can use:

if cost == "nan":
  optimizer = an empty optimizer 
else:
  ...
  optimizer = real optimizer

But I cannot find the type of nan. How can I check a variable is nan or not?

How else can I solve this problem?

Upvotes: 9

Views: 38208

Answers (4)

ForrestZhang
ForrestZhang

Reputation: 634

I find a similar problem here TensorFlow cross_entropy NaN problem

Thanks to the author user1111929

tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

Or

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

Solved nan problem.

Upvotes: 9

Greg K
Greg K

Reputation: 81

The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:

import math
val = float('nan')
val
if math.isnan(val):
    print('Detected NaN')
    import pdb; pdb.set_trace() # Break into debugger to look around

Upvotes: 8

Ilyakom
Ilyakom

Reputation: 190

Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.

Upvotes: 7

Fematich
Fematich

Reputation: 1618

I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logits should be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:

  1. Implement your own softmax_cross_entropy_with_logits function, e.g. try (source):

    epsilon = tf.constant(value=0.00001, shape=shape)
    logits = logits + epsilon
    softmax = tf.nn.softmax(logits)
    cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
    
  2. Update your data so that it does have a valid probability distribution

Upvotes: 2

Related Questions