Nikhil Panse
Nikhil Panse

Reputation: 21

GPU program failed to execute : cublas runtime error

I am trying to train a network via pytorch on CUDA enabled GeForce GTX 1070 gpu. I don't understand the error nor have I found any similar problem anywhere. I don't know if its cuda's issue or something in my code.

Traceback (most recent call last):
  File "main.py", line 497, in <module>
    main()
  File "main.py", line 167, in main
    train(train_loader, model, criterion, optimizer, epoch, normalizer)
  File "main.py", line 244, in train
    output = model(*input_var)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\1546544\Desktop\ML\model.py", line 147, in forward
    atom_fea = conv_func(atom_fea, nbr_fea, nbr_fea_idx)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\1546544\Desktop\ML\model.py", line 66, in forward
    total_gated_fea = self.fc_full(total_nbr_fea)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\linear.py", line 55, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py", line 837, in linear
    output = input.matmul(weight.t())
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\autograd\variable.py", line 386, in matmul
    return torch.matmul(self, other)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\functional.py", line 192, in matmul
    output = torch.mm(tensor1, tensor2)
RuntimeError: cublas runtime error : the GPU program failed to execute at C:/Anaconda2/conda-bld/pytorch_1519496000060/work/torch/lib/THC/THCBlas.cu:247

Upvotes: 2

Views: 2866

Answers (1)

Uzzal Podder
Uzzal Podder

Reputation: 3205

I faced the same problem.

I fixed this problem by dataset label correction. I mean, training label was incorrect for my dataset. That's why it was failed during backward() pass.

So, checking the expected label after loading it from disk/database might be helpful.

Upvotes: 1

Related Questions