Reputation: 499
I have a container that loads a Pytorch model. Every time I try to start it up, I get this error:
Traceback (most recent call last):
File "server/start.py", line 166, in <module>
start()
File "server/start.py", line 94, in start
app.register_blueprint(create_api(), url_prefix="/api/1")
File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
atomic_demo_model = DemoModel(model_filepath, comet_dir)
File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
model.to(cfg.device)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
I know that nvidia-docker2
is working.
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1A:00.0 Off | N/A |
| 0% 44C P0 72W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1B:00.0 Off | N/A |
| 0% 44C P0 66W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:1E:00.0 Off | N/A |
| 0% 44C P0 48W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:3E:00.0 Off | N/A |
| 0% 41C P0 54W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... Off | 00000000:3F:00.0 Off | N/A |
| 0% 42C P0 48W / 260W | 0MiB / 10989MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 0% 42C P0 1W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
However, I keep getting the error above.
I've tried the following:
Setting "default-runtime": nvidia
in /etc/docker/daemon.json
Using docker run --runtime=nvidia <IMAGE_ID>
Adding the variables below to my Dockerfile:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"
I expect this container to run - we have a working version in production without these issues. And I know that Docker can find the drivers, as the output above shows. Any ideas?
Upvotes: 30
Views: 39935
Reputation: 2117
In order for docker to use the host GPU drivers and GPUs, some steps are necessary.
--gpus
flag (as explained in the link above)I guess you have done the first 3 points because nvidia-docker2
is working. So since you don't have a --gpus
flag in your run command this could be the issue.
I usually run my containers with the following command
docker run --name <container_name> --gpus all -it <image_name>
-it
is just that the container is interactive and starts a bash environment.
Upvotes: 16
Reputation: 1289
I had a similar issue, but when trying to check whether nvidia-smi worked within the docker container, I discovered that the CUDA version was missing from the output.
So docker run --gpus all image_name nvidia-smi
returned:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
This was easily fixed using part of chirag's answer:
docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility image_name nvidia-smi
similarly, this flag allowed me to run the script I originally had been trying to run:
docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility image_name python my_script.py
Upvotes: 1
Reputation: 21
If you are running your solution on a GPU powered AWS EC2 machine and are using an EKS optimized accelerated AMI,
as was the case with us, then you are not required to set the runtime to nvidia
by yourself, as that is the default runtime of the accelarated AMIs. The same can be verified by checking the /etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf
/etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf
All that was required was to set these 2 environment variables, as suggested by Chirag in the above answer and here(Nvidia container-toolkit user guide)
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility
or -e NVIDIA_DRIVER_CAPABILITIES=all
-e NVIDIA_VISIBLE_DEVICES=all
Before reaching to the final solution, I also tried setting the runtime in daemon.json
. To start with, the AMIs we were using did not have a daemon.json
file, they instead contain a key.json
file. Tried setting the runtime in both the files, but restarting the docker always overwrote the changes in key.json
or simply deleted the daemon.json
file.
Upvotes: 2
Reputation: 4587
For me, I was running from a vanilla ubuntu
base docker image, i.e.
FROM ubuntu
Changing to an Nvidia-provided Docker base image solved the issue for me:
FROM nvidia/cuda:11.2.1-runtime-ubuntu20.04
Upvotes: 3
Reputation: 583
I got the same error. After trying number of solutions I found the below
docker run -ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all <image_name>
Upvotes: 14