wl2776
wl2776

Reputation: 4327

Multistage docker build: stat reports that NVIDIA file does not exist while it does

I'm trying to merge two docker images.

Here is my Dockerfile

FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop

COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda

COPY --from=cuda10 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/

Build fails:

$ docker build . -t nvidia-ros:osrf
Step 5/7 : COPY --from=cuda10 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/
COPY failed: stat usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03: file does not exist

However these files do exist:

$ docker run -it --rm --gpus all nvidia/cuda:10.0-devel-ubuntu18.04
root@fc9c1d8ccdc2:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root       37 Jan 30 14:13 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.460.32.03
-rw-r--r-- 1 root root 12129448 Aug 20  2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rw-r--r-- 1 root root 10516984 Dec 27 18:55 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03

Upvotes: 2

Views: 1090

Answers (1)

anemyte
anemyte

Reputation: 20306

TL;DR: This file is mounted by the runtime (docs), so it will not be present at the build time. You need to have a couple environment variables in your image or at the container start for the NVIDIA runtime to mount driver libraries inside. Check out the Dockerfile at the end for an example.

To investigate this I ran this command first:

docker run --rm --entrypoint="" -it nvidia/cuda:10.0-devel-ubuntu18.04 \
stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03

And got the same error:

stat: cannot stat '/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03': No such file or directory

So I went into the directory and looked with ls:

root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libnvidia-ptxjitcompiler.so
ls: cannot access 'libnvidia-ptxjitcompiler.so': No such file or directory

root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libn
libnccl.so         libnccl_static.a   libnpth.so.0       libnsl.so          libnss_files.so    libnss_nisplus.so  
libnccl.so.2       libnettle.so.6     libnpth.so.0.1.1   libnss_compat.so   libnss_hesiod.so   
libnccl.so.2.6.4   libnettle.so.6.4   libnsl.a           libnss_dns.so      libnss_nis.so      

There file was missing.

Then I used the command you have shared:

docker run -it --rm --runtime nvidia nvidia/cuda:10.0-devel-ubuntu18.04

root@4a1602f3d5c0:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root       34 Jan 30 14:48 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.450.66
-rw-r--r-- 1 root root 12129448 Aug 20  2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rwxr-xr-x 1 root root  9947144 Sep 28 10:57 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66

The files were there, but the version was different and it matched my NVIDIA driver version:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+

So it appeared to me that this file only exists when you use NVIDIA runtime to start the container. I googled this and found a confirmation here. Documentation states that you need to run a container with several environment variables for driver libs to be mounted. So I've run env command in an official NVIDIA container and copied every variable with NVIDIA_ prefix into the Dockerfile:

FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop

COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda

ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
ENV NVIDIA_VISIBLE_DEVICES=all

Running the new image with NVIDIA runtime I found the files mounted:

docker run --runtime nvidia --rm -it afae756457a9

root@7ebdef701231:/# stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
  File: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
  Size: 9947144         Blocks: 19432      IO Block: 4096   regular file
Device: 801h/2049d      Inode: 131438      Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-30 14:48:05.765015216 +0000
Modify: 2020-09-28 10:57:18.067125173 +0000
Change: 2020-09-28 10:57:18.067125173 +0000
 Birth: -

Upvotes: 1

Related Questions