Reputation: 663
I created a Google VM instance using this available image:
c1-deeplearning-common-cu100-20191226
Description
Google, Deep Learning Image: Base, m39 (with CUDA 10.0), A Debian based image with CUDA 10.0
I then installed Anaconda onto this VM, then installed Pytorch using the following command line as recommended by the Pytorch website:
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
(this corresponds to Linux, Python 3.7, CUDA 10.1)
From Python, I ran this code to check the GPU detection:
import torch
torch.cuda.is_available()
False
From the nvidia-smi tool, this is the result even after the main body of code is running the training:
(base) redexces.bf@tensorflow-1x-2x:~$ nvidia-smi
Thu Jan 2 01:33:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 37C P0 22W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Clearly, there are no running processes nor any memory allocated.
This problem appears to be related to Pytorch only; the same VM also has Tensorflow-gpu installed in a separate conda environment which recognizes the GPU and utilizes it as I would expect.
Am I missing any pieces? Again the same CUDA driver and image are working fine for tensorflow.
Upvotes: 3
Views: 1464
Reputation: 663
I was able to resolve the issue. Not being a computer science guy, I figured that it could be an nvidia driver compatibility issue. Since Pytorch was built using CUDA 10.1 driver, and the deep learning image had CUDA 10.0 installed, I created another VM instance but this time instead of using the public image noted earlier, I used the gcloud command line to specify deep learning with cu10.1 driver. This made it all work as expected.
Upvotes: 3