Reputation: 6841
After training a PyTorch model on a GPU for several hours, the program fails with the error
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Training Conditions
nn.LSTM
with nn.Linear
outputstate
passed into forward()
has the shape (32, 20, 15)
, where 32
is the batch sizeMy code also has the following values set before the training began
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)
How can we troubleshoot this problem? Since this occurred 8 hours into the training, some educated guess will be very helpful here!
Thanks!
Update:
Commenting out the 2 torch.backends.cudnn...
lines did not work. CUDNN_STATUS_INTERNAL_ERROR
still occurs, but much earlier at around Episode 300 (585,000 steps).
torch.manual_seed(0)
#torch.backends.cudnn.deterministic = True
#torch.backends.cudnn.benchmark = False
np.random.seed(0)
System
Error Traceback
RuntimeError Traceback (most recent call last)
<ipython-input-18-f5bbb4fdfda5> in <module>
57
58 while not done:
---> 59 action = agent.choose_action(state)
60 state_, reward, done, info = env.step(action)
61 score += reward
<ipython-input-11-5ad4dd57b5ad> in choose_action(self, state)
58 if np.random.random() > self.epsilon:
59 state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60 actions = self.q_eval.forward(state)
61 action = T.argmax(actions).item()
62 else:
<ipython-input-10-94271a92f66e> in forward(self, state)
20
21 def forward(self, state):
---> 22 lstm, hidden = self.lstm(state)
23 actions = self.fc1(lstm[:,-1:].squeeze(1))
24 return actions
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
575 result = self._slow_forward(*input, **kwargs)
576 else:
--> 577 result = self.forward(*input, **kwargs)
578 for hook in self._forward_hooks.values():
579 hook_result = hook(self, input, result)
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\rnn.py in forward(self, input, hx)
571 self.check_forward_args(input, hx, batch_sizes)
572 if batch_sizes is None:
--> 573 result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
574 self.dropout, self.training, self.bidirectional, self.batch_first)
575 else:
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Update: Tried try... except
on my code where this error occurs at, and in addition to RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
, we also get a second traceback for the error RuntimeError: CUDA error: unspecified launch failure
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-4-e8f15cc8cf4f> in <module>
61
62 while not done:
---> 63 action = agent.choose_action(state)
64 state_, reward, done, info = env.step(action)
65 score += reward
<ipython-input-3-1aae79080e99> in choose_action(self, state)
58 if np.random.random() > self.epsilon:
59 state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60 actions = self.q_eval.forward(state)
61 action = T.argmax(actions).item()
62 else:
<ipython-input-2-6d22bb632c4c> in forward(self, state)
25 except Exception as e:
26 print('error in forward() with state:', state.shape, 'exception:', e)
---> 27 print('state:', state)
28 actions = self.fc1(lstm[:,-1:].squeeze(1))
29 return actions
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\tensor.py in __repr__(self)
152 def __repr__(self):
153 # All strings are unicode in Python 3.
--> 154 return torch._tensor_str._str(self)
155
156 def backward(self, gradient=None, retain_graph=None, create_graph=False):
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _str(self)
331 tensor_str = _tensor_str(self.to_dense(), indent)
332 else:
--> 333 tensor_str = _tensor_str(self, indent)
334
335 if self.layout != torch.strided:
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _tensor_str(self, indent)
227 if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
228 self = self.float()
--> 229 formatter = _Formatter(get_summarized_data(self) if summarize else self)
230 return _tensor_str_with_formatter(self, indent, formatter, summarize)
231
~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
99
100 else:
--> 101 nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
102
103 if nonzero_finite_vals.numel() == 0:
RuntimeError: CUDA error: unspecified launch failure
Upvotes: 20
Views: 74227
Reputation: 955
I was able to resolve this issue by upgrading cuda & cudnn. In my case, I had been running my program on an outdated Nvidia Docker image with cuda 11.8 and cudnn 8:
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
The issued disappeared when I upgraded to a more recent Nvidia Docker image (as of Oct '24), which has cuda 12.6 and cudnn devel:
nvidia/cuda:12.6.2-cudnn-devel-ubuntu22.04
A full list Nvidia Docker images are available here. For those who aren't using docker, you can try upgrading your cuda and cudnn versions manually!
Upvotes: 0
Reputation: 301
Personally I consider this problem relates to GPU memory allocation. My workaround solution is kind of ... alternative, but it works for me most of the time. I'm working on an Ubuntu 22.04LTS workstation and from time to time either this problem occurs or "torch.cuda.OutOfMemoryError: CUDA out of memory" comes. Either problem persists no matter if my program do "gc.collect()" or "torch.cuda.empty_cache()".
However, whenever I have these problems I only have to use my 'TeamViewer' (currently version 15.45.3, installed on my Ubuntu workstation) to connect to another machine, operated a few seconds then disconnect it, quit my TeamViewer. Magically my program will work again and usually kept working for a long time. Sometimes merely quit my browser (which usually utilize GPU) also works.
The logic behind it I believe is because TeamViewer will itself causing GPU memory allocation and deallocation which eventually make the GPU memory condition changes and eventually bypassed the unknown root problem that considers difficult to debug as Michael's answer indicated. My guess is that other programs that use GPU might also help but I didn't spend time checking so if anyone found other workarounds or please kindly provide. Thanks.
Upvotes: 1
Reputation: 756
For me it was because there are two processes from the previous run that somehow weren't killed properly, and they occupied two GPUs, causing the same cudnn error.
The error disappears after killing these two processes
Upvotes: 1
Reputation: 21
This might not work for everyone as there could be other factors like workers, installed Cuda version and more.
For me, a system restart fixed it on my Windows 11 machine with an Nvidia Geforce RTX3070 with 8GB memory. My machine had been on for days with many programs getting in and out of the GPU.
Upvotes: 0
Reputation: 1171
Anyone coming across this error as well as other cudnn/gpu related errors should try to change the model and inputs to cpu, generally the cpu runtime has much better error reporting and will enable you to debug the issue.
In my experience majority of the time the error comes from invalid index on an embedding.
Upvotes: 10
Reputation: 21
I ran into the same problem and resolved it by downgrading cudatoolkit to version 10.1. So try to reinstall pytorch with cudatoolkit 10.1.
conda install pytorch torchvision cudatoolkit=10.1
Upvotes: 2
Reputation: 32972
The error RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
is notoriously difficult to debug, but surprisingly often it's an out of memory problem. Usually, you would get the out of memory error, but depending on where it occurs, PyTorch cannot intercept the error and therefore not provide a meaningful error message.
A memory issue seems to be likely in your case, because you are using a while loop until the agent is done, which might take long enough that you run out of memory, it's just a matter of time. That can also possibly occur rather late, once the model's parameters in combination with a certain input is unable to finish in time.
You can avoid that scenario by limiting the number of allowed actions instead of hoping that the actor will be done in a reasonable time.
What you also need to be careful about, is that you don't occupy unnecessary memory. A common mistake is to keep computing gradients of the past states in future iterations. The state from the last iteration should be considered constant, since the current action should not affect past actions, therefore no gradients are required. This is usually achieved by detaching the state from the computational graph for the next iteration, e.g. state = state_.detach()
. Maybe you are already doing that, but without the code it's impossible to tell.
Similarly, if you keep a history of the states, you should detach them and even more importantly put them on the CPU, i.e. history.append(state.detach().cpu())
.
Upvotes: 34