Justin Song
Justin Song

Reputation: 581

Failed to initialize NVML: Unknown Error in Docker after Few hours

I am having interesting and weird issue.

When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can't use gpus in docker.

When I do nvidia-smi in docker machine. I see this msg

"Failed to initialize NVML: Unknown Error"

However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.

My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?

Upvotes: 58

Views: 54144

Answers (8)

wangzijian
wangzijian

Reputation: 24

i have a try that when docker can not run nvidia-smi, i restart the container and this help

Upvotes: -1

user2256593
user2256593

Reputation: 879

I selected the Method 1 of my findings below, which is:

  1. sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false, save

  2. Restart docker daemon: sudo systemctl restart docker, then you can test by running sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Based on

  1. https://bobcares.com/blog/docker-failed-to-initialize-nvml-unknown-error/ 2. https://bbs.archlinux.org/viewtopic.php?id=266915

Upvotes: 77

Kirill Zaitsev
Kirill Zaitsev

Reputation: 81

I faced the same error without any changes to my container, just after starting it anew. Simply restarting the container again solved the problem.

Moral: before going deeper, try the simplest solution first.

Upvotes: 3

pinyotae
pinyotae

Reputation: 406

There is a workaround that I tried and found it work. Please check this link in case you need full detail: https://github.com/NVIDIA/nvidia-docker/issues/1730

I summarize the cause of the problem and elaborate on a solution here for your convenience.

Cause:
The host performs daemon-reload (or a similar activity). If the container uses systemd to manage cgroups, daemon-reload "triggers reloading any Unit files that have references to NVIDIA GPUs." Then, your container loses access the reloaded GPU references.

How to check if your problem is caused by the issue:
When your container still has GPU access, open a "host" terminal and run

sudo systemctl daemon-reload

Then, go back to your container. If nvidia-smi in the container has the problem right away, you may continue to use the workarounds.

Workarounds:
Although I saw in one discussion that NVIDIA planned to release a formal fix in mid Jun, as of July 8, 2023, I did not see it yet. So, this should be still useful for you, especially when you just can't update your container stack.

The easiest way is to disable cgroups in your containers through docker daemon.json. If disabling cgroups does not hurt you, here is the steps. All is done in the host system.

sudo nano /etc/docker/daemon.json 

Then, within the file, add this parameter setting.

"exec-opts": ["native.cgroupdriver=cgroupfs"] 

Do not forget to add a comma before this parameter setting. It is a well-known JSON syntax, but I think some may not be familiar with it. This is an example edited file from my machine.

{  
   "runtimes": {  
       "nvidia": {  
           "args": [],  
           "path": "nvidia-container-runtime"  
       }  
   },  
   "exec-opts": ["native.cgroupdriver=cgroupfs"]  
} 

As for the last step, restart the docker service in the host.

sudo service docker restart

Note: if your container runs its own NVIDIA driver, the above steps will not work, but the reference link has more detail for dealing with it. I elaborate only on a simple solution that I expect many people will find it useful.

Upvotes: 24

phi
phi

Reputation: 585

Slightly different, but for other people that might stumble upon this.

For me the GPUs were not available already after start of the docker container with nvidia-docker, but only showed Failed to initialize NVML: Unknown Error on nivida-smi.

After some hours of looking for a solution I stumbled upon the similar error Failed to initialise NVML: Driver/library version mismatch. And one suggestion was to simply reboot the host machine. I did that and it now works.

This happened after I upgraded both Ubuntu 20->22 and Docker 19->20 along with the nvidia drivers 525.116.04.

Upvotes: 0

nalsas
nalsas

Reputation: 11

I had the same weird issue. According to your description, it's most likely relevant to this issue on nvidia-docker official repo:

https://github.com/NVIDIA/nvidia-docker/issues/1618

I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.

ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.

Upvotes: 1

sih4sing5hog5
sih4sing5hog5

Reputation: 320

I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.

Docker-compose Version:

services:
  gpu_container:
    ...
    healthcheck:
      test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
      start_period: 1s
      interval: 20s
      timeout: 5s
      retries: 2
    labels:
      - autoheal=true
      - autoheal.stop.timeout=1
    restart: always
  autoheal:
    image: willfarrell/autoheal
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

Dockerfile Version:

HEALTHCHECK \
    --label autoheal=true \
    --label autoheal.stop.timeout=1 \
    --start-period=60s \
    --interval=20s \
    --timeout=10s \  
    --retries=2 \
    CMD nvidia-smi || exit 1

with autoheal daemon:

docker run -d \
    --name autoheal \
    --restart=always \
    -e AUTOHEAL_CONTAINER_LABEL=all \
    -v /var/run/docker.sock:/var/run/docker.sock \
    willfarrell/autoheal

Upvotes: 6

Sandro
Sandro

Reputation: 1

I had the same issue, I just ran screen watch -n 1 nvidia-smi in the container and now it works continuously.

Upvotes: -3

Related Questions