J J
J J

Reputation: 478

Unable to Run NVIDIA GPU-Enabled Docker Containers Inside an LXC Container

Question:

I am facing an issue when trying to run Docker containers that require GPU access within an LXC container. Standard Docker containers run fine, but when I try to use the NVIDIA GPU by adding --gpus=all or --runtime=nvidia, the container fails to start.

The error message I receive is:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

Environment:

LXC Config:

# Allow cgroup access
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm
lxc.cgroup2.devices.allow: c 511:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 239:* rwm
lxc.cgroup2.devices.allow: c 243:* rwm

# Pass through device files
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

What I've Tried:

I am looking for any guidance on how to debug this issue and successfully run GPU-enabled Docker containers within an LXC container.

Upvotes: 3

Views: 1727

Answers (1)

datu-puti
datu-puti

Reputation: 1363

This is the process I used to get nvidia-smi working in docker in LXC on Proxmox:

  1. (Baseline): Does nvidia-smi work on the host and in the lxc container? If not, I wrote up the process a few years ago (it's an older post but it should still get you there) that you can read here.

Within Ubuntu 22.04 LXC Container - Not tested on another distro. I'll copy the commands here, but consider going to the source to troubleshoot.

  1. Install docker engine per the docker documentation.
$ for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
$ # Add Docker's official GPG key:
$ sudo apt-get update
$ sudo apt-get install ca-certificates curl
$ sudo install -m 0755 -d /etc/apt/keyrings
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
$ sudo chmod a+r /etc/apt/keyrings/docker.asc

$ # Add the repository to Apt sources:
$ echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
$ sudo docker run hello-world
  1. Install the NVIDIA Container Toolkit per the NVIDIA documentation.
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker
  1. Modify the NVIDIA config file, per this comment.

Not sure if this helps but I got stuck forever trying to get Nvidia docker to run inside a non-privileged lxc, the fix for me was to change set no-cgroups = true in the Nvidia docker config file. /etc/nvidia-container-runtime/config.toml

  1. Test with this command:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Hope this works for you, but this can be quite brittle, though it has gotten better.

Upvotes: 1

Related Questions