maque J
maque J

Reputation: 1079

CUDA initialization: Unexpected error from cudaGetDeviceCount()

I was running a deep learning program on my Linux server and I suddenly got this error.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)

Earlier when I just created this conda environment, torch.cuda.is_available() returned true and I could use CUDA & GPU. But all of a sudden I could not use CUDA and torch.cuda.is_available()returned false. What should I do?

ps. I use GeForce RTX 3080 and cuda 11.0 + pytorch 1.7.0. It worked before but now it doesn't.

Upvotes: 55

Views: 117470

Answers (8)

Juan
Juan

Reputation: 816

Check the guide here -> https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

  1. Check the status of the fabricmanager with:
nvidia-smi -q -i 0 | grep -i -A 2 Fabric

If you see "In Progress"

Fabric
State : In Progress
Status : N/A
  1. Reset the fabricmanager service and the GPUs:

  2. Stop the service

systemctl stop nvidia-fabricmanager.service
  1. Reset the gpus
nvidia-smi -r
  1. Start the service again
systemctl start nvidia-fabricmanager.service

Upvotes: 0

404pio
404pio

Reputation: 1030

Let me reboot k8s node...

In my case problem was little complex, because it was not PC, but server with k8s and nvidia-container-toolkit. Toolkit is managing some of the nvidia/cuda libraries inside container. The key to check it was running command ls -al /usr/lib/x86_64-linux-gnu/ | grep libcuda in working and not working containers. I think that output of command and linked cuda library version should match version of host.:

Working

lrwxrwxrwx  1 root root       12 Mar  8 15:15 libcuda.so -> libcuda.so.1
lrwxrwxrwx  1 root root       21 Mar  8 15:15 libcuda.so.1 -> libcuda.so.525.147.05
-rw-r--r--  1 root root 29867944 Oct 25 20:37 libcuda.so.525.147.05
lrwxrwxrwx  1 root root       29 Mar  8 15:15 libcudadebugger.so.1 -> libcudadebugger.so.525.147.05
-rw-r--r--  1 root root 10490248 Oct 25 20:18 libcudadebugger.so.525.147.05

Not working

lrwxrwxrwx 1 root root        12 Mar  8 15:17 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root        20 Mar  8 15:17 libcuda.so.1 -> libcuda.so.530.30.02
-rw-r--r-- 1 root root  29867944 Oct 25 20:37 libcuda.so.525.147.05
-rw-r--r-- 1 root root  29900840 Feb 22  2023 libcuda.so.530.30.02
lrwxrwxrwx 1 root root        28 Mar  8 15:17 libcudadebugger.so.1 -> libcudadebugger.so.530.30.02
-rw-r--r-- 1 root root  10490248 Oct 25 20:18 libcudadebugger.so.525.147.05
-rw-r--r-- 1 root root  10488936 Feb 16  2023 libcudadebugger.so.530.30.02

Host:

/usr/lib/x86_64-linux-gnu/libcuda.so.525.147.05

To fix the problem, one can simply change the libcuda.so.1 link target from libcuda.so.530.30.02 to libcuda.so.525.147.05 in my situation. In your situations you may have different driver versions.

Upvotes: 3

Kunyu Shi
Kunyu Shi

Reputation: 326

For people who are having this issue after updaing your driver. You could try

sudo apt-get install nvidia-fabricmanager-535

to update library version. Replace the 535 with your driver version.

Upvotes: 3

K V
K V

Reputation: 59

It can append. Try reinstalling a nvidia driver. Then reboot you computer (it's important), and check if nvidia-smi works.

Upvotes: 0

Peter Chiang
Peter Chiang

Reputation: 130

First check the nvidia-fabricmanager service status:

systemctl status nvidia-fabricmanager

If you see that the nvidia-fabricmanager service is in active (running) state, it is running properly, otherwise restart:

systemctl start nvidia-fabricmanager

This works for me!

Upvotes: 8

ashkan
ashkan

Reputation: 476

This is my experience:

  • I had a PyTorch 1.12, an Nvidia GeForce RTX 2080, cuda/11.3.1, and cudnn/8.2.4.15-11.4 on my system, and I got CUDA initialization error.

  • The error had been solved by only changing the cudnn version, i.e., I used cudnn/8.2.0.53-11.3 and the error was gone.

Upvotes: 0

Satya Prakash Dash
Satya Prakash Dash

Reputation: 1216

Try to run nvidia-smi in a different terminal and if you get an error like: NVML: Driver/library version mismatch then basically you have to follow these steps, so you won't have to reboot again:

  1. In a terminal run: lsmod | grep nvidia.
  2. Then unload the module dependent on nvidia driver:
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
  1. Finally, unload the nvidia module: sudo rmmod nvidia.
  2. Now when you try lsmod | grep nvidia, you should get nothing in the terminal output.
  3. Now run nvidia-smi to check if you get the desired output and you are good to go!

Upvotes: 13

maque J
maque J

Reputation: 1079

I just tried rebooting. Problem solved. Turned out that it was caused by NVIDIA NVML Driver/library version mismatch.

Upvotes: 52

Related Questions