Christian
Christian

Reputation: 3393

Why does PyTorch not find my NVDIA drivers for CUDA support?

I've added an GeForce GTX 1080 Ti into my machine (Running Ubuntu 18.04 and Anaconda with Python 3.7) to utilize the GPU when using PyTorch. Both cards a correctly identified:

$ lspci | grep VGA
03:00.0 VGA compatible controller: NVIDIA Corporation GF119 [NVS 310] (reva1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

The NVS 310 handles my 2-monitor setup, I only want to utilize the 1080 for PyTorch. I also installed the latest NVIDIA drivers that are currently in the repository and that seems to be fine:

$ nvidia-smi 
Sat Jan 19 12:42:18 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVS 310             Off  | 00000000:03:00.0 N/A |                  N/A |
| 30%   60C    P0    N/A /  N/A |    461MiB /   963MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   41C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

Driver version 390.xx allows to run CUDA 9.1 (9.1.85) according the the NVIDIA docs. Since this is also the version in the Ubuntu repositories, I simple installed the CUDA Toolkit with:

$ sudo apt-get-installed nvidia-cuda-toolkit

And again, this seems be alright:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

and

$ apt-cache policy nvidia-cuda-toolkit
nvidia-cuda-toolkit:
  Installed: 9.1.85-3ubuntu1
  Candidate: 9.1.85-3ubuntu1
  Version table:
 *** 9.1.85-3ubuntu1 500
        500 http://sg.archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages
        100 /var/lib/dpkg/status

Lastly, I've installed PyTorch from scratch with conda

conda install pytorch torchvision -c pytorch

Also error as far as I can tell:

$ conda list
...
pytorch                   1.0.0           py3.7_cuda9.0.176_cudnn7.4.1_1    pytorch
...

However, PyTorch doesn't seem to find CUDA:

$ python -c 'import torch; print(torch.cuda.is_available())'
False

In more detail, if I force PyTorch to convert a tensor x to CUDA with x.cuda() I get the error:

Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from 82 http://...

What am I'm missing here? I'm new to this, but I think I've checked the Web already quite a bit to find any caveats like NVIDIA driver and CUDA toolkit versions?

EDIT: Some more outputs from PyTorch:

print(torch.cuda.device_count())   # --> 0
print(torch.cuda.is_available())   # --> False
print(torch.version.cuda)          # --> 9.0.176

Upvotes: 18

Views: 56228

Answers (5)

yu yang Jian
yu yang Jian

Reputation: 7165

In my csae, ubuntu wsl is used, wsl version influence, see https://github.com/pytorch/pytorch/issues/73487

Upvotes: 0

As mentioned before you will need to set your CUDA_VISIBLE_DEVICES.

If you want to use 1 GPU it would be:

CUDA_VISIBLE_DEVICES=1

You can find more details if you want to have a more complex setup in the following link: How do I select which GPU to run a job on?

Upvotes: 0

shivasanjeeva
shivasanjeeva

Reputation: 1

You can load the data and the model to a GPU. You can create dataloaders and load them into your local system if it has GPU support, or you can use it, for example, online on kaggle or colab server as well. You can change the batch_size, num_workers, etc depending on your system if running it locally.

from torch.utils.data import DataLoader

def get_default_device():
"""Pick GPU if available, else CPU"""
if torch.cuda.is_available():
    return torch.device('cuda')
else:
    return torch.device('cpu')

def to_device(data, device):
"""Move tensor(s) to chosen device"""
if isinstance(data, (list,tuple)):
    return [to_device(x, device) for x in data]
return data.to(device, non_blocking=True)

class DeviceDataLoader():
"""Wrap a dataloader to move data to a device"""
def __init__(self, dl, device):
    self.dl = dl
    self.device = device
    
def __iter__(self):
    """Yield a batch of data after moving it to device"""
    for b in self.dl: 
        yield to_device(b, self.device)

def __len__(self):
    """Number of batches"""
    return len(self.dl)

Upvotes: 0

Ahmed Ktob
Ahmed Ktob

Reputation: 3104

I have had the same issue when trying to use PyTorch to train in our server (has 4 GPUs), so I didn't have the option of just removing the GPUs.

However, I am using docker and docker-compose to run my training. Thus I found this pytorch image from nvidia that comes with all the necessary setup. Please before you pull the image, make sure to check this page to determine which image tag is compatible with your nvidia driver version (if you pull the wrong one, it won't work).

Then, in your docker-compose file, you can specify which GPUs to use as follow:

version: '3.5'

services:
  training:
    build:
      context: ""
      dockerfile: Dockerfile
    container_name: training
    environment:
      - CUDA_VISIBLE_DEVICES=0,2
    ipc: "host"

Make sure to set ipc to "host", which will allow your docker container to use the host shared memory and not the one allocated to docker (insufficient).

Upvotes: 1

prosti
prosti

Reputation: 46291

Since you had two graphic cards, selecting a card ID CUDA_VISIBLE_DEVICES=GPU_ID should fix the problem as per this explanation.

Upvotes: 1

Related Questions