Reputation: 398
Rebooting an instance on tuesday, I first ran into the problem of losing GPU support on a AWS p2.xlarge machine with the Ubuntu Deep Learning AMI.
I tested it three times now on two days and a collegue had the same problem, so I guess it is a AWS bug. Though maybe someone has an idea how to debug it better.
Basically, after shutdown and reboot, the instance no longer has the nvidia module loaded in the kernel. Furthermore, according to dmesg, there seems to be a different kernel loaded. All of this happens without me actively causing it.
Here are the steps to reproduce the problem using a fresh instance and no custom code. I am working in Ireland (eu-west-1), the instance was launched in the Availability Zone eu-west-1a:
ubuntu@...:~$ lsmod | grep nvidia
nvidia 16592896 0
ipmi_msghandler 49152 1 nvidia
dmesg | less
...
[ 0.000000] Linux version 4.4.0-1075-aws (buildd@lgw01-amd64-035) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019 (Ubuntu 4.4.0-1075.85-aws 4.4.167)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1075-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
Tue Mar 19 16:41:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 42C P8 32W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
ubuntu@...:~$ sudo shutdown now
ubuntu@...:~$ lsmod | grep nvidia
(no output)
dmesg | less
...
[ 0.000000] Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1077-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
How could I force to boot with the kernel 4.4.0-1075-aws? Since it is hvm virtualization, I cannot choose a kernel directly in the dialog.
Upvotes: 13
Views: 5087
Reputation: 403
I experienced the same issue on an AWS G5 instance running Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD
with the latest NVIDIA driver (which as of the time of this post is version 535.129.03).
This doesn't happen after every restart, but rather sporadically. This is probably the 4th time it happened to me in the last 3-4 months.
What worked was to simply reinstall the driver (this assumes that all prerequisites for the driver have been met and will likely not work if you are doing a fresh install on a new instance):
# Make sure the instance has the AmazonS3ReadOnlyAccess policy
$ aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ . \
&& chmod +x NVIDIA-Linux-x86_64*.run \
&& sudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-x86_64*.run --silent
Then verify that the install went all right:
$ nvidia-smi
Thu Nov 2 10:04:01 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
...
Finally, reboot:
sudo reboot
Upvotes: 0
Reputation: 414
In amazon p3 ec2 machine build from image with Ubuntu 22.04.3 LTS I had to do:
sudo apt-get install nvidia-cuda-toolkit
sudo apt-get install ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
Terveisin, Markus
Upvotes: 0
Reputation: 314
I experienced the same issue and it helped me to do
sudo apt-get install nvidia-cuda-toolkit
sudo reboot
Good luck!
Upvotes: 4
Reputation: 7740
There seems to be a problem with building older NVIDIA drivers on 4.4.0-107x-aws kernels. You can install newer NVIDIA drivers, which should work fine with the current kernel:
wget http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run
sudo sh ./NVIDIA-Linux-x86_64-410.104.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
According to an AWS representative, the drivers were updated in the Deep Learning AMI on 21/03/2019 [AWS forums].
Upvotes: 10