Reputation: 1079
I was running a deep learning program on my Linux server and I suddenly got this error.
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
Earlier when I just created this conda environment, torch.cuda.is_available()
returned true
and I could use CUDA & GPU. But all of a sudden I could not use CUDA and torch.cuda.is_available()
returned false
. What should I do?
ps. I use GeForce RTX 3080 and cuda 11.0 + pytorch 1.7.0. It worked before but now it doesn't.
Upvotes: 55
Views: 117470
Reputation: 816
Check the guide here -> https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
If you see "In Progress"
Fabric
State : In Progress
Status : N/A
Reset the fabricmanager service and the GPUs:
Stop the service
systemctl stop nvidia-fabricmanager.service
nvidia-smi -r
systemctl start nvidia-fabricmanager.service
Upvotes: 0
Reputation: 1030
Let me reboot k8s node...
In my case problem was little complex, because it was not PC, but server with k8s and nvidia-container-toolkit. Toolkit is managing some of the nvidia/cuda libraries inside container. The key to check it was running command ls -al /usr/lib/x86_64-linux-gnu/ | grep libcuda
in working and not working containers. I think that output of command and linked cuda library version should match version of host.:
Working
lrwxrwxrwx 1 root root 12 Mar 8 15:15 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 21 Mar 8 15:15 libcuda.so.1 -> libcuda.so.525.147.05
-rw-r--r-- 1 root root 29867944 Oct 25 20:37 libcuda.so.525.147.05
lrwxrwxrwx 1 root root 29 Mar 8 15:15 libcudadebugger.so.1 -> libcudadebugger.so.525.147.05
-rw-r--r-- 1 root root 10490248 Oct 25 20:18 libcudadebugger.so.525.147.05
Not working
lrwxrwxrwx 1 root root 12 Mar 8 15:17 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 20 Mar 8 15:17 libcuda.so.1 -> libcuda.so.530.30.02
-rw-r--r-- 1 root root 29867944 Oct 25 20:37 libcuda.so.525.147.05
-rw-r--r-- 1 root root 29900840 Feb 22 2023 libcuda.so.530.30.02
lrwxrwxrwx 1 root root 28 Mar 8 15:17 libcudadebugger.so.1 -> libcudadebugger.so.530.30.02
-rw-r--r-- 1 root root 10490248 Oct 25 20:18 libcudadebugger.so.525.147.05
-rw-r--r-- 1 root root 10488936 Feb 16 2023 libcudadebugger.so.530.30.02
Host:
/usr/lib/x86_64-linux-gnu/libcuda.so.525.147.05
To fix the problem, one can simply change the libcuda.so.1
link target from libcuda.so.530.30.02
to libcuda.so.525.147.05
in my situation. In your situations you may have different driver versions.
Upvotes: 3
Reputation: 326
For people who are having this issue after updaing your driver. You could try
sudo apt-get install nvidia-fabricmanager-535
to update library version. Replace the 535 with your driver version.
Upvotes: 3
Reputation: 59
It can append. Try reinstalling a nvidia driver. Then reboot you computer (it's important), and check if nvidia-smi works.
Upvotes: 0
Reputation: 130
First check the nvidia-fabricmanager service status:
systemctl status nvidia-fabricmanager
If you see that the nvidia-fabricmanager service is in active (running) state, it is running properly, otherwise restart:
systemctl start nvidia-fabricmanager
This works for me!
Upvotes: 8
Reputation: 476
This is my experience:
I had a PyTorch 1.12, an Nvidia GeForce RTX 2080, cuda/11.3.1
, and cudnn/8.2.4.15-11.4
on my system, and I got CUDA initialization error
.
The error had been solved by only changing the cudnn
version, i.e., I used cudnn/8.2.0.53-11.3
and the error was gone.
Upvotes: 0
Reputation: 1216
Try to run nvidia-smi
in a different terminal and if you get an error like: NVML: Driver/library version mismatch
then basically you have to follow these steps, so you won't have to reboot again:
lsmod | grep nvidia
.sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
.lsmod | grep nvidia
, you should get nothing in the terminal output.nvidia-smi
to check if you get the desired output and you are good to go!Upvotes: 13
Reputation: 1079
I just tried rebooting. Problem solved. Turned out that it was caused by NVIDIA NVML Driver/library version mismatch.
Upvotes: 52