3nomis
3nomis

Reputation: 1613

Pytorch fails with CUDA error: device-side assert triggered on Colab

I am trying to initialize a tensor on Google Colab with GPU enabled.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

t = torch.tensor([1,2], device=device)

But I am getting this strange error.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Even by setting that environment variable to 1 seems not showing any further details.
Anyone ever had this issue?

Upvotes: 90

Views: 339249

Answers (18)

Matt
Matt

Reputation: 96

I had this issue in colab trying to use torch's load_inline function. Switching to another GPU provider from colab fixed my issue if anyone was in hell trying to do this as well.

Upvotes: 0

Vasundhara Acharya
Vasundhara Acharya

Reputation: 11

Change the runtime to CPU and run the same code. It will clearly mention what the issue is. Then you can fix the issue and change the runtime to GPU again. It will work.

Upvotes: 0

hmadinei
hmadinei

Reputation: 41

I switched to the CPU to find the error, which was an index mismatch. After fixing it, I switched back to the GPU.

Upvotes: 2

lam vu Nguyen
lam vu Nguyen

Reputation: 631

In my case, it is because of param higher than value which can be accepted

https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2#transformers.SeamlessM4Tv2Model.generate.speaker_id

i ran with speaker_id parameter which is equal to config.vocoder_num_spkrs (200) so RuntimeError: CUDA error: device-side assert triggered, then I tried with speaker_id which had been accepted before but it was still RuntimeError: CUDA error

I closed (Ctrl C) and reopened; it well ran againt

Upvotes: 0

Wilson Westbrook
Wilson Westbrook

Reputation: 1

I am coming from the VQGAN+Clip "ai-art" community. I get this error when I already have a session running on another tab. Killing all sessions from the session manager clears it up, and let's you connect with the new tab, which is nice if you have fiddled with a lot of settings you don't want to lose.

Upvotes: 0

SarthakJain
SarthakJain

Reputation: 1686

While I tried your code, and it did not give me an error, I can say that usually the best practice to debug CUDA Runtime Errors: device-side assert like yours is to turn collab to CPU and recreate the error. It will give you a more useful traceback error.

Most of the time CUDA Runtime Errors can be the cause of some index mismatching so like you tried to train a network with 10 output nodes on a dataset with 15 labels. And the thing with this CUDA error is once you get this error once, you will recieve it for every operation you do with torch.tensors. This forces you to restart your notebook.

I suggest you restart your notebook, get a more accuracate traceback by moving to CPU, and check the rest of your code especially if you train a model on set of targets somewhere.

To gain a clearer insight into the typical utilization of GPUs in PyTorch applications, I recommend exploring deep learning projects on GitHub. Websites such as repo-rift.com can be particularly useful for this purpose. They allow you to perform text searches with queries like "How does this paper use GPU". This can help you pinpoint the exact usage of CUDA in specific lines of code within extensive repositories.

Upvotes: 119

questionto42
questionto42

Reputation: 9532

A full restart with new memory fixed it in Jupyter Notebook on my own server, perhaps it also helps if you are on Colab (untested):

I guess this means the same as the other answer about restarting the runtime, just done on your own.

Upvotes: -1

JuanK
JuanK

Reputation: 26

I had the same issue while fine tuning a tiny autoregressive model. The problem was caused in the dataloader where I was adding "-100" to the last position of the lable tensor.

labels[:, -1] = -100  # Typically, -100 is used to ignore the loss calculation at specific positions

I removed this line and the problem was solved.

Upvotes: 0

AMAN SWARAJ
AMAN SWARAJ

Reputation: 15

I also ran into a similar error, and the problem was with the label mismatch only! My train set and test set had different label counts and thus, this error was coming.

Upvotes: 0

Bumped into the same issue when using Transformers Trainer. In my case, the issue was caused by model input and tokenizer length sizes mismatch. Here's what solved the issue for me:

model.resize_token_embeddings(len(tokenizer))

and mismatch was caused when adding pad token:

tokenizer.add_special_tokens({'pad_token': '<pad>'})

Upvotes: 17

Naveen Mathew
Naveen Mathew

Reputation: 372

This is an open-ended question for most people who land on this page because the underlying issue is different in each case. In my case the error appeared on Colab when I tried to run this notebook on Colab pro: https://colab.research.google.com/drive/1SRclU2pcgzCkVXpmhKppVbGW4UcCs5xT?usp=sharing at supervised_finetuning_trainer.train() step.

If there's someone like me who could not bring the computation into CPU instead of GPU (mostly because the error stack-trace led to a different package like transformers, ..., leading all the way back to pytorch), here's the approach to get a more accurate stack-trace:

https://github.com/huggingface/transformers/blob/ad78d9597b224443e9fe65a94acc8c0bc48cd039/docs/source/en/troubleshooting.md?plain=1#L110

Credits: sgugger on GitHub.

Upvotes: 2

Mahmood Hussain
Mahmood Hussain

Reputation: 501

In my case, I first tried to run my computations on the CPU to detect the actual issue. It turned out that my image transforms were wrong I was applying some un-necessary transformation to my mask image

Upvotes: 0

yuchen2727
yuchen2727

Reputation: 41

I had the same problem on Colab as well. If your code runs normally on device("cpu"), try deleting the current Colab runtime and restart it. This worked for me.

Upvotes: 4

Hoyeol Kim
Hoyeol Kim

Reputation: 219

Double-check the number of gpu. Normally, it should be gpu=0 unless you have more than one gpu.

Upvotes: 3

yun li
yun li

Reputation: 11

I also encountered this problem and found the reason, because the vocabulary dimension is 8000, but the embedding dimension in my model is set to 5000

Upvotes: 1

Tiffany Zhao
Tiffany Zhao

Reputation: 93

Maybe, I mean in some cases

It is due to you forgetting to add a sigmoid activation before you send the logit to BCE Loss.

Hope it can help :P

Upvotes: 3

Shaida Muhammad
Shaida Muhammad

Reputation: 1650

1st time:

Got the same error while using simpletransformers library to fine-tuning transformer-based model for multi-class classification problem. simpletransformers is a library written on the top of transformers library.

I changed my labels from string representations to numbers and it worked.

2nd time:

Face the same error again while training another transformer-based model with transformers library, for text classification. I had 4 labels in the dataset, named 0,1,2, and 3. But in the last layer (Linear Layer) of my model class, I had two neurons. nn.Linear(*, 2)* which I had to replace by nn.Linear(*, 4) because I had total four labels.

Upvotes: 5

tschomacker
tschomacker

Reputation: 804

As the other respondents indicated: Running it on CPU reveals the error. My target labels where {1,2} I changed them to {0,1}. This procedure solved it for me.

Upvotes: 10

Related Questions