Reputation: 407
I am using AWS to train a CNN on a custom dataset. I launched a p2.xlarge instance, uploaded my (Python) scripts to the virtual machine, and I am running my code via the CLI.
I activated a virtual environment for TensorFlow(+Keras2) with Python3 (CUDA 10.0 and Intel MKL-DNN), which was a default option via AWS.
I am now running my code to train the network, but it feels like the GPU is not 'activated'. The training goes just as fast (slow) as when I run it locally with a CPU.
This is the script that I am running:
https://github.com/AntonMu/TrainYourOwnYOLO/blob/master/2_Training/Train_YOLO.py
I also tried to alter it by putting with tf.device('/device:GPU: 0'):
after the parser (line 142) and indenting everything underneath under there. However, this doesn't seem to have changed anything.
Any tips on how to activate the GPU (or check if the GPU is activated)?
Upvotes: 5
Views: 15180
Reputation: 1
Option 1) pre-installed drivers e.g. "AWS Deep Learning Base GPU AMI (Ubuntu 20.04)"
This AMI is documented at: https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/ and can be found on the AWS EC2 web UI Launch instance by searching for "gpu" under the "Quickstart AMIs" section (their search is terrible btw). I believe it is maintained by Amazon.
I have tested it on a g5.xlarge
, documented at: https://aws.amazon.com/ec2/instance-types/g5/ which I believe is currently the most powerful single Nvidia GPU machine available (Nvidia A10G) as of December 2023. Make sure to use a US region as they are cheaper there, us-east-1
(North Virginia) was one of the cheapest when I checked, at 1.006 USD / hour, so a negligible cost for most people in a developed country. Just make sure to shutdown the VM each time to not keep paying!!!
Another working alternative is g4dn.xlarge
, which is the cheapest GPU machine at 0.526 USD / hour on us-east-1
and runs an Nvidia T4, but I don't think there's much point in it as it is just half the price of the most powerful GPU choice, so why not just go for the most powerful one which which might save you some of your precious time by making such interactive experiments faster? This one should only be a consideration when optimizing deployment costs.
Also, to get access to g5.xlarge
, first you have to request your vCPU limit to be increased to 4 as per: You have requested more vCPU capacity than your current vCPU limit of 0 since the GPU machines all seem to require at least 4 vCPUs, it is supremely annoying.
Once you finally get the instance and the image, running:
nvidia-smi
just works and returns:
Tue Dec 19 18:43:59 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 18C P8 9W / 300W | 4MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
This means the drivers are working, and from then on I managed run several software that use the GPU and watch nvidia-smi
show the GPU usage go up.
The documentation page also links to: https://docs.aws.amazon.com/dlami/latest/devguide/gs.html which is a guide on the so called "AWS Deep Learning AMI" (DLAMI) which appears to be a selection of deep learning AMI variants by AWS, though unfortunately many of the ones documented there use Amazon Linux (RPM-based) rather than Ubuntu.
A sample AWS CLI that launches it is:
aws ec2 run-instances --image-id ami-095ff65813edaa529 --count 1 --instance-type g5.xlarge \
--key-name <yourkey> --security-group-ids sg-<yourgroup>
Option 2) install the drivers yourself on the base Ubuntu image "Ubuntu Server 22.04 LTS (HVM)"
This option adds extra time to the installation, but it has the advantage of giving you a newer Ubuntu and greater understanding of what the image contains. Driver installation on Ubuntu 22.04 was super easy, so this is definitely a viable option.
Just pick the first Ubuntu AMI Amazon suggests when launching an instance and run:
sudo apt update
sudo apt install nvidia-driver-510 nvidia-utils-510
sudo reboot
and from there on nvidia-smi
and everything else just works on g5.xlarge
.
Related question: https://askubuntu.com/questions/1397934/how-to-install-nvidia-cuda-driver-on-aws-ec2-instance
Upvotes: 0
Reputation: 407
In the end it had to do with my tensorflow package! I had to uninstall tensorflow and install tensorflow-gpu. After that the GPU was automatically activated.
For documentation see: https://www.tensorflow.org/install/gpu
Upvotes: 2
Reputation: 573
Checkout this answer for listing available GPUs.
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
You can also use CUDA to list the current device and, if necessary, set the device.
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())
Upvotes: 3