user13153466
user13153466

Reputation:

Pytorch loss is nan

I'm trying to write my first neural network with pytorch. Unfortunately, I encounter a problem when I want to get the loss. The following error message:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

So I tried debugging and found something strange. The input has no nans and infs as I verify with the following:

print(torch.any(torch.isnan(inputs)))

But if I always let the individual steps in the model x be output, I see that there will be inf at some point.

training

inputs, labels = data
print(torch.any(torch.isnan(inputs)))
optimizer.zero_grad()
outputs = model(inputs)
print(outputs)
loss = criterion(outputs, labels)
print(f"epoch: {epoch + 1} loss: {loss.item()}")
loss.backward()

optimizer.step()

model

class Net(Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1 = Conv1d(in_channels=1, out_channels=5, kernel_size=5, stride=2, dtype=torch.float64)
        self.act1 = ReLU()
        self.pool1 = MaxPool1d(2)
        self.layer2 = Conv1d(in_channels=5, out_channels=1, kernel_size=2, dtype=torch.float64)
        self.fcl1 = Linear(1350, 16, dtype=torch.float64)

    def forward(self, x):
        print("raw", x)
        x = self.layer1(x)
        print("conv1d 1", x)
        x = self.act1(x)
        print("relu", x)
        x = self.layer2(x)
        print("conv1d 2", x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        print("pools", x)
        x = self.fcl1(x)
        print("linear", x)
        return x

output

tensor(False)
raw tensor([[9.0616e+227, 2.4353e-152,  1.0294e-71,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]], dtype=torch.float64)
conv1d 1 tensor([[   -inf,    -inf,    -inf,  ..., -0.2516, -0.2516, -0.2516],
        [    inf,     inf,     inf,  ...,  0.3377,  0.3377,  0.3377],
        [   -inf,    -inf,    -inf,  ...,  0.4285,  0.4285,  0.4285],
        [   -inf,    -inf,    -inf,  ..., -0.1230, -0.1230, -0.1230],
        [    inf,     inf,     inf,  ...,  0.3793,  0.3793,  0.3793]],
       dtype=torch.float64, grad_fn=<SqueezeBackward1>)
relu tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [   inf,    inf,    inf,  ..., 0.3377, 0.3377, 0.3377],
        [0.0000, 0.0000, 0.0000,  ..., 0.4285, 0.4285, 0.4285],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [   inf,    inf,    inf,  ..., 0.3793, 0.3793, 0.3793]],
       dtype=torch.float64, grad_fn=<ReluBackward0>)
conv1d 2 tensor([[        -inf,         -inf,         -inf,  ..., -5.4167e+265,
         -5.4167e+265, -5.4167e+265]], dtype=torch.float64,
       grad_fn=<SqueezeBackward1>)
pools tensor([[        -inf, -5.4167e+265, -5.4167e+265,  ..., -5.4167e+265,
         -5.4167e+265, -5.4167e+265]], dtype=torch.float64,
       grad_fn=<SqueezeBackward1>)
linear tensor([[inf, inf, -inf, -inf, -inf, inf, inf, inf, inf, inf, inf, -inf, inf, inf, -inf, -inf]],
       dtype=torch.float64, grad_fn=<AddmmBackward0>)
tensor([[inf, inf, -inf, -inf, -inf, inf, inf, inf, inf, inf, inf, -inf, inf, inf, -inf, -inf]],
       dtype=torch.float64, grad_fn=<AddmmBackward0>)
epoch: 1 loss: nan

Thanks for helping

Upvotes: 7

Views: 15380

Answers (3)

Brian
Brian

Reputation: 31272

Common causes for NAN loss

  1. the learning rate is too high
  2. faulty input:
# check if input has zeros 
numpy.all(x) # return True if there are zeros, otherwise return False

# check if input has nans 
numpy.any(numpy.isnan(x)) # return True if there are nans, otherwise return False
  1. faulty loss function(not likely when using built-in funcitons)

Upvotes: 2

Nitin Kishore
Nitin Kishore

Reputation: 41

If it is an error in the 0th output, that means your weights are still not fully updates so some values in your first predictions are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a tensor(nan).What you can do is put a check for when loss is nan and let the weights adjust themselves

criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ...

Upvotes: 2

ki-ljl
ki-ljl

Reputation: 509

Sorry, my reputation is not enough for me to comment directly. This may be caused by the exploding gradient due to the excessive learning rate. It is recommended that you reduce the learning rate or use weight_decay.

Upvotes: 6

Related Questions