infinitely_improbable
infinitely_improbable

Reputation: 499

How to get Docker to recognize NVIDIA drivers?

I have a container that loads a Pytorch model. Every time I try to start it up, I get this error:

Traceback (most recent call last):
  File "server/start.py", line 166, in <module>
    start()
  File "server/start.py", line 94, in start
    app.register_blueprint(create_api(), url_prefix="/api/1")
  File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
    atomic_demo_model = DemoModel(model_filepath, comet_dir)
  File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
    model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
  File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
    model.to(cfg.device)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

I know that nvidia-docker2 is working.

$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
|  0%   44C    P0    72W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
|  0%   44C    P0    66W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
|  0%   44C    P0    48W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:3E:00.0 Off |                  N/A |
|  0%   41C    P0    54W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:3F:00.0 Off |                  N/A |
|  0%   42C    P0    48W / 260W |      0MiB / 10989MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
|  0%   42C    P0     1W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However, I keep getting the error above.

I've tried the following:

  1. Setting "default-runtime": nvidia in /etc/docker/daemon.json

  2. Using docker run --runtime=nvidia <IMAGE_ID>

  3. Adding the variables below to my Dockerfile:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"

I expect this container to run - we have a working version in production without these issues. And I know that Docker can find the drivers, as the output above shows. Any ideas?

Upvotes: 30

Views: 39935

Answers (6)

Nopileos
Nopileos

Reputation: 2117

In order for docker to use the host GPU drivers and GPUs, some steps are necessary.

  • Make sure an nvidia driver is installed on the host system
  • Follow the steps here to setup the nvidia container toolkit
  • Make sure cuda, cudnn is installed in the image
  • Run a container with the --gpus flag (as explained in the link above)

I guess you have done the first 3 points because nvidia-docker2 is working. So since you don't have a --gpus flag in your run command this could be the issue.

I usually run my containers with the following command

docker run --name <container_name> --gpus all -it <image_name>

-it is just that the container is interactive and starts a bash environment.

Upvotes: 16

5Ke
5Ke

Reputation: 1289

I had a similar issue, but when trying to check whether nvidia-smi worked within the docker container, I discovered that the CUDA version was missing from the output.

So docker run --gpus all image_name nvidia-smi returned:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

This was easily fixed using part of chirag's answer:

docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility image_name nvidia-smi

similarly, this flag allowed me to run the script I originally had been trying to run:

docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility image_name python my_script.py

Upvotes: 1

Rishabh Gupta
Rishabh Gupta

Reputation: 21

If you are running your solution on a GPU powered AWS EC2 machine and are using an EKS optimized accelerated AMI, as was the case with us, then you are not required to set the runtime to nvidia by yourself, as that is the default runtime of the accelarated AMIs. The same can be verified by checking the /etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf

  • ssh into the AWS machine
  • cat /etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf

All that was required was to set these 2 environment variables, as suggested by Chirag in the above answer and here(Nvidia container-toolkit user guide)

  • -e NVIDIA_DRIVER_CAPABILITIES=compute,utility or -e NVIDIA_DRIVER_CAPABILITIES=all
  • -e NVIDIA_VISIBLE_DEVICES=all

Before reaching to the final solution, I also tried setting the runtime in daemon.json. To start with, the AMIs we were using did not have a daemon.json file, they instead contain a key.json file. Tried setting the runtime in both the files, but restarting the docker always overwrote the changes in key.json or simply deleted the daemon.json file.

Upvotes: 2

wwcc
wwcc

Reputation: 31

just use"docker run --gpus all",add "--gpus all" or "--gpus 0" !

Upvotes: 2

Jacob Stern
Jacob Stern

Reputation: 4587

For me, I was running from a vanilla ubuntu base docker image, i.e.

FROM ubuntu

Changing to an Nvidia-provided Docker base image solved the issue for me:

FROM nvidia/cuda:11.2.1-runtime-ubuntu20.04

Upvotes: 3

chirag
chirag

Reputation: 583

I got the same error. After trying number of solutions I found the below

docker run -ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all <image_name>

Upvotes: 14

Related Questions