Aidan Gomez
Aidan Gomez

Reputation: 8627

Common causes of nans during training of neural networks

I've noticed that a frequent occurrence during training is NANs being introduced.

Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.

Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?

The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?

Upvotes: 132

Views: 88710

Answers (8)

so2
so2

Reputation: 322

I have seen NaNs arise in Pytorch presumably due to divide-by-zero when the gradients are computed by in the computation graph. For e.g., if the computed function is sqrt(x^2), then the ideal representation of the gradient would be x, but weirdly enough, the gradient computation seemed to calculate 2*x / (2*sqrt(x^2)) and the output value was NaN as autograd apparently (though I couldn't confirm this with further debugging) due to that calculation working out to be 0/0

Upvotes: 0

Prajwal_Surendra
Prajwal_Surendra

Reputation: 41

Okay, so this is a very common problem with neural networks, specifically with RNN.

If your model is CNN or any other NN, please continue reading the wonderful and detailed explanation by @shai and @desertnaut in the same page.

However, if the error is related to LSTM specifically, continue here.

  1. First things first, check if your input has nan or np.inf values. usually, this is not the case, but make sure these are not present.

  2. The most probable reason is that you are using relu as an activation function. LSTM's or RNN in general for that matter, don't work well with relu. Read here for further explanation: Problems of relu.

  3. So the solution for the exploding gradient problem you are encountering can be switching to tanh from relu, which is the default in LSTM's.

  4. You can also try reducing the learning rate which minimizes the exploding gradients.

  5. One last thing you can also do if you are bent on using relu, is using clipping_values, so that the gradients do not explode while backpropagation (reduce the value of clipvalue in your case). You can also try using droupout, but it will not be very helpful compared to the other suggestions.

Hope this helps and best of luck :)

Upvotes: 1

Steven H
Steven H

Reputation: 127

One more solution for anyone stuck like I just was-

I was receiving nan or inf losses on a network I setup with float16 dtype across the layers and input data. After all else failed, it occurred to me to switch back to float32, and the nan losses were solved!

So bottom line, if you switched dtype to float16, change it back to float32.

Upvotes: 6

izady
izady

Reputation: 61

In my case, not setting the bias in the convolution/deconvolution layers was the cause.

Solution: add the following to the convolution layer parameters.

bias_filler {
      type: "constant"
      value: 0
    }

Upvotes: 6

Shai
Shai

Reputation: 114786

I came across this phenomenon several times. Here are my observations:


Gradient blow up

Reason: large gradients throw the learning process off-track.

What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.

What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.


Bad learning rate policy and params

Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.

What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:

... sgd_solver.cpp:106] Iteration 0, lr = -nan

What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.


Faulty Loss function

Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.

What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.

For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.


Faulty input

Reason: you have an input with nan in it!

What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.


stride larger than kernel size in "Pooling" layer

For some reason, choosing stride > kernel_size for pooling may results with nans. For example:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

results with nans in y.


Instabilities in "BatchNorm"

It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.


Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.

Upvotes: 183

learning_rate is high and should be decreased The accuracy in the RNN code was nan, with select the low value for learning rate it fixes

Upvotes: 2

Shai
Shai

Reputation: 114786

This answer is not about a cause for nans, but rather proposes a way to help debug it. You can have this python layer:

class checkFiniteLayer(caffe.Layer):
  def setup(self, bottom, top):
    self.prefix = self.param_str
  def reshape(self, bottom, top):
    pass
  def forward(self, bottom, top):
    for i in xrange(len(bottom)):
      isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
      if isbad>0:
        raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
                        (self.prefix,i,100*float(isbad)/bottom[i].count))
  def backward(self, top, propagate_down, bottom):
    for i in xrange(len(top)):
      if not propagate_down[i]:
        continue
      isf = np.sum(1-np.isfinite(top[i].diff[...]))
        if isf>0:
          raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
                          (self.prefix,i,100*float(isf)/top[i].count))

Adding this layer into your train_val.prototxt at certain points you suspect may cause trouble:

layer {
  type: "Python"
  name: "check_loss"
  bottom: "fc2"
  top: "fc2"  # "in-place" layer
  python_param {
    module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
    layer: "checkFiniteLayer"
    param_str: "prefix-check_loss" # string for printouts
  }
}

Upvotes: 4

LKB
LKB

Reputation: 177

I was trying to build a sparse autoencoder and had several layers in it to induce sparsity. While running my net, I encountered the NaN's. On removing some of the layers (in my case, I actually had to remove 1), I found that the NaN's disappeared. So, I guess too much sparsity may lead to NaN's as well (some 0/0 computations may have been invoked!?)

Upvotes: 0

Related Questions