Reputation: 11
I'm facing an issue when trying to run an NVIDIA GPU-supported Docker container on my system. Despite successful detection of NVIDIA drivers and GPUs via nvidia-smi
, attempting to run a Docker container with the command docker run --rm --gpus all ubuntu:18.04 nvidia-smi
results in the following error:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
Here's the output of nvidia-smi, showing that the NVIDIA drivers and GPUs are correctly detected and operational:
$ nvidia-smi
Thu Feb 22 02:39:45 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:18:00.0 Off | 0 |
| 30% 37C P8 14W / 230W | 11671MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:86:00.0 Off | 0 |
| 55% 80C P2 211W / 230W | 13119MiB / 23028MiB | 79% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
To troubleshoot, I ran nvidia-container-cli -k -d /dev/tty info, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, are detected. However, the Docker error persists, suggesting an issue with locating libnvidia-ml.so.1.
So far, I've attempted:
Reinstalling NVIDIA drivers and CUDA Toolkit. Reinstalling NVIDIA Container Toolkit. Ensuring Docker and NVIDIA Container Toolkit are correctly configured. Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries. Despite these efforts, the problem remains unresolved. I'm operating on a Linux system with NVIDIA driver version 525.85.12.
Has anyone experienced a similar issue or can offer insights into what might be causing this error and how to resolve it? I would greatly appreciate any suggestions or guidance.
Running a Docker Container with NVIDIA GPU Support: You attempted to start a Docker container using the NVIDIA GPU with the command docker run --rm --gpus all ubuntu:18.04 nvidia-smi
.
Checking NVIDIA Driver and GPU Detection: You used nvidia-smi to ensure that the NVIDIA drivers and GPUs were correctly detected and operational on your system.
Diagnostic with NVIDIA Container Toolkit: You ran nvidia-container-cli -k -d /dev/tty info to diagnose the issue, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, were detected by your system.
Attempted Solutions for Resolution:
Successful Container Initialization: You expected the Docker container to initialize successfully with NVIDIA GPU support, allowing you to use GPU resources within the container.
Resolution of Library Detection Issue: You anticipated that the steps taken would resolve any issues related to the detection of libnvidia-ml.so.1, ensuring that Docker and the NVIDIA Container Toolkit could access and utilize the necessary NVIDIA libraries.
Operational GPU Support in Docker: Ultimately, you expected these troubleshooting steps to enable seamless GPU support within Docker containers, allowing for GPU-accelerated applications to run as intended.
The discrepancy between the expected outcomes and the actual results – the persistent error message indicating an inability to find libnvidia-ml.so.1 despite confirmed detection of NVIDIA drivers and libraries – suggests that there might be an underlying issue with the Docker and NVIDIA integration setup, library paths, or possibly with specific versions of the tools and drivers involved.
Upvotes: 0
Views: 6564
Reputation: 61
What distribution is your host using? The NVIDIA driver installed by Ubuntu's ubuntu-drivers install
tool can also cause the problem. To resolve it, you may need to reinstall the driver. First, the existing driver needs to be uninstalled (specially for Ubuntu, for other distributions, check here):
sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
"*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
sudo apt-get autoremove
After that, it is highly recommended to use the package manager apt
instead to reinstall the driver. Below is the instructions (still for Ubuntu 22.04, check here for more platforms):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# To install the legacy kernel module flavor
sudo apt-get install -y cuda-drivers
# To install the open kernel module flavor of specific version
# sudo apt-get install -y nvidia-driver-550-open
Note that the NVIDIA Container Toolkit also has been uninstalled by previous apt-get --purge
commands. You can follow these steps to reinstall it.
It's better to switch to HWE kernel for your server:
sudo apt-get install --install-recommends linux-generic-hwe-22.04
The driver will also install x11 components for you by default. If the desktop is not needed, you can install headless version of the driver instead:
sudo apt-get install nvidia-headless-550
Upvotes: 2