Reputation: 581
I am having interesting and weird issue.
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can't use gpus in docker.
When I do nvidia-smi
in docker machine. I see this msg
"Failed to initialize NVML: Unknown Error"
However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.
My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?
Upvotes: 58
Views: 54144
Reputation: 24
i have a try that when docker can not run nvidia-smi, i restart the container and this help
Upvotes: -1
Reputation: 879
I selected the Method 1 of my findings below, which is:
sudo vim /etc/nvidia-container-runtime/config.toml
, then changed no-cgroups = false
, save
Restart docker daemon: sudo systemctl restart docker
, then you can test by running sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Based on
Upvotes: 77
Reputation: 81
I faced the same error without any changes to my container, just after starting it anew. Simply restarting the container again solved the problem.
Moral: before going deeper, try the simplest solution first.
Upvotes: 3
Reputation: 406
There is a workaround that I tried and found it work. Please check this link in case you need full detail: https://github.com/NVIDIA/nvidia-docker/issues/1730
I summarize the cause of the problem and elaborate on a solution here for your convenience.
Cause:
The host performs daemon-reload (or a similar activity). If the container uses systemd to manage cgroups, daemon-reload "triggers reloading any Unit files that have references to NVIDIA GPUs." Then, your container loses access the reloaded GPU references.
How to check if your problem is caused by the issue:
When your container still has GPU access, open a "host" terminal and run
sudo systemctl daemon-reload
Then, go back to your container. If nvidia-smi in the container has the problem right away, you may continue to use the workarounds.
Workarounds:
Although I saw in one discussion that NVIDIA planned to release a formal fix in mid Jun, as of July 8, 2023, I did not see it yet. So, this should be still useful for you, especially when you just can't update your container stack.
The easiest way is to disable cgroups in your containers through docker daemon.json. If disabling cgroups does not hurt you, here is the steps. All is done in the host system.
sudo nano /etc/docker/daemon.json
Then, within the file, add this parameter setting.
"exec-opts": ["native.cgroupdriver=cgroupfs"]
Do not forget to add a comma before this parameter setting. It is a well-known JSON syntax, but I think some may not be familiar with it. This is an example edited file from my machine.
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
As for the last step, restart the docker service in the host.
sudo service docker restart
Note: if your container runs its own NVIDIA driver, the above steps will not work, but the reference link has more detail for dealing with it. I elaborate only on a simple solution that I expect many people will find it useful.
Upvotes: 24
Reputation: 585
Slightly different, but for other people that might stumble upon this.
For me the GPUs were not available already after start of the docker container with nvidia-docker, but only showed Failed to initialize NVML: Unknown Error
on nivida-smi
.
After some hours of looking for a solution I stumbled upon the similar error Failed to initialise NVML: Driver/library version mismatch
. And one suggestion was to simply reboot
the host machine. I did that and it now works.
This happened after I upgraded both Ubuntu 20->22 and Docker 19->20 along with the nvidia drivers 525.116.04
.
Upvotes: 0
Reputation: 11
I had the same weird issue. According to your description, it's most likely relevant to this issue on nvidia-docker official repo:
https://github.com/NVIDIA/nvidia-docker/issues/1618
I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.
ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.
Upvotes: 1
Reputation: 320
I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.
Docker-compose Version:
services:
gpu_container:
...
healthcheck:
test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
start_period: 1s
interval: 20s
timeout: 5s
retries: 2
labels:
- autoheal=true
- autoheal.stop.timeout=1
restart: always
autoheal:
image: willfarrell/autoheal
environment:
- AUTOHEAL_CONTAINER_LABEL=all
volumes:
- /var/run/docker.sock:/var/run/docker.sock
restart: always
Dockerfile Version:
HEALTHCHECK \
--label autoheal=true \
--label autoheal.stop.timeout=1 \
--start-period=60s \
--interval=20s \
--timeout=10s \
--retries=2 \
CMD nvidia-smi || exit 1
with autoheal daemon:
docker run -d \
--name autoheal \
--restart=always \
-e AUTOHEAL_CONTAINER_LABEL=all \
-v /var/run/docker.sock:/var/run/docker.sock \
willfarrell/autoheal
Upvotes: 6
Reputation: 1
I had the same issue, I just ran screen watch -n 1 nvidia-smi
in the container and now it works continuously.
Upvotes: -3