Faraz H
Faraz H

Reputation: 159

Unable to install NVIDIA driver on various GCP Ubuntu VM's with Tesla K80 GPU

I have followed this GCP guide with Ubuntu 18 and 20 (have also tried Ubuntu Lite, Debian and Centos 7) but, unfortunately, after completing the lengthy install I get this:

me@gpu:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

I have tried installing via the script and via the direct downloads from the Nvidia site for Cuda 10. Ready to pull my hair out if that helps! I don't understand how a company that builds a bazillion GPU's can't make the installation process robust?

I have also tried these recommendations with no luck.

Upvotes: 0

Views: 1604

Answers (2)

If you've installed the driver so many times and nvidia-smi is still failing to communicate, take a look into prime-select.

  1. Run prime-select query, this way you are going to get all possible options, it should show at least nvidia | intel.

  2. Select prime-select nvidia.

  3. Then, if you see nvidia is already selected, choose a different one, e.g. prime-select intel. Next, switch back to nvidia prime-select nvidia

  4. Reboot and check nvidia-smi.

Plus, it could be a good idea to run again:

sudo apt install nvidia-cuda-toolkit

When it finishes, reboot the machine, and nvidia-smi should work then.

Now, in other cases it works to follow these instructions to install CuDNn and Cuda on VMs cuda_11.2_installation_on_Ubuntu_20.04.

And finally, in some other cases it is caused by unattended-upgrades. Take a look into the settings and adjust them if it is causing unexpected results. This URL has the documentation for Debian, and I was able to see that you already tested with that distro UnattendedUpgrades.

Upvotes: 1

Faraz H
Faraz H

Reputation: 159

I was able to get it working. The mistake I was making was not doing the pre-installation steps before running the cuda_10.1.243_418.87.00_linux.run script. I was under the impression the *.run file would do everything for me. It would help if users were told they MUST do the pre-installation steps. Specifically I had to do this for Ubuntu 18:

sudo nano /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
sudo update-initramfs -u
reboot

This seems like a bit of a “hack”, so not sure why nvidia can’t make the installation process more robust? They make a bazillion of these cards. It’s not like some homemade product with a niche user base…

Upvotes: 2

Related Questions