How to conda install CUDA enabled PyTorch in a Docker container?

I am trying to build a Docker container on a server within which a conda environment is built. All the other requirements are satisfied except for CUDA enabled PyTorch (I can get PyTorch working without CUDA however, no problem). How do I make sure PyTorch is using CUDA?

This is the Dockerfile :

# Use nvidia/cuda image
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04

# set bash as current shell
RUN chsh -s /bin/bash

# install anaconda
RUN apt-get update
RUN apt-get install -y wget bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git mercurial subversion && \
        apt-get clean
RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -O ~/anaconda.sh && \
        /bin/bash ~/anaconda.sh -b -p /opt/conda && \
        rm ~/anaconda.sh && \
        ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
        echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
        find /opt/conda/ -follow -type f -name '*.a' -delete && \
        find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
        /opt/conda/bin/conda clean -afy

# set path to conda
ENV PATH /opt/conda/bin:$PATH

# setup conda virtual environment
COPY ./requirements.yaml /tmp/requirements.yaml
RUN conda update conda \
    && conda env create --name camera-seg -f /tmp/requirements.yaml \
    && conda install -y -c conda-forge -n camera-seg flake8

# From the pythonspeed tutorial; Make RUN commands use the new environment
SHELL ["conda", "run", "-n", "camera-seg", "/bin/bash", "-c"]

# PyTorch with CUDA 10.2
RUN conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

RUN echo "conda activate camera-seg" > ~/.bashrc
ENV PATH /opt/conda/envs/camera-seg/bin:$PATH

This gives me the following error when I try to build this container ( docker build -t camera-seg . ):


Step 10/12 : RUN conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
 ---> Running in e0dd3e648f7b
ERROR conda.cli.main_run:execute(34): Subprocess for 'conda run ['/bin/bash', '-c', 'conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch']' command failed.  (See above for error)

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

The command 'conda run -n camera-seg /bin/bash -c conda activate camera-seg && conda install pytorch torchvision cudatoolkit=10.2 -c pytorch' returned a non-zero code: 1

This is the requirements.yaml:

name: camera-seg
  - defaults
  - conda-forge
  - python=3.6
  - numpy
  - pillow
  - yaml
  - pyyaml
  - matplotlib
  - jupyter
  - notebook
  - tensorboardx
  - tensorboard
  - protobuf
  - tqdm

When I put pytorch, torchvision and cudatoolkit=10.2 within the requirements.yaml, then PyTorch is successfully installed but it cannot recognize CUDA ( torch.cuda.is_available() returns False ).

I have tried various solutions, for example, this, this and this and some different combinations of them but all to no avail.

Any help is much appreciated. Thanks.

Answers (3)

Check nvidia drivers are installed :

nvidia-smi -a

if not working try :

sudo apt-get install linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-drivers

Note : Reboot your os after this installation

Install nvidia docker run-time on you machine :

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
echo -e  '{\n    "default-runtime": "nvidia",\n    "runtimes": {\n        "nvidia": {\n            "path": "/usr/bin/nvidia-container-runtime",\n            "runtimeArgs": []\n        }\n    }\n}' > /etc/docker/daemon.json
sudo systemctl restart docker

Verify nvidia is setup for the docker :

sudo docker info | grep Runtime

code for docker file :

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
RUN apt-get update
RUN apt-get install -y locales
RUN localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
ENV LANG=en_US.utf8
RUN export LANG=en_US.utf8

#           install dependecies                  #
RUN apt-get update && apt-get install -y wget git build-essential nano unzip curl 
RUN apt-get update && apt-get install -y g++

#           install conda                        #
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
RUN bash miniconda.sh -b -p /miniconda
RUN rm miniconda.sh
ENV PATH="/miniconda/bin:${PATH}"
RUN echo 'export PATH="/miniconda/bin:${PATH}"' >> ~/.bashrc
RUN /miniconda/bin/conda config --set auto_activate_base true
RUN /miniconda/bin/conda init

#           install PYTORCH                      #
RUN /miniconda/bin/conda  install -y pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

#           install code-server                  #
RUN curl -fsSL https://code-server.dev/install.sh | sh

#                  CMD                           #
CMD code-server --bind-addr --auth none /home

I managed to set it up with the following Dockerfile:

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04
ENV TZ=Europe/Brussels

RUN apt-get update --fix-missing && DEBIAN_FRONTEND=noninteractive apt-get install --assume-yes --no-install-recommends \
   build-essential \
   python3 \
   python3-dev \

RUN pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

I made sure the cuda version is the same as installed on the machine where the docker container would be running.

Then I did docker build and run as follows:

$ docker build . -t docker-example:latest
$ docker run --gpus all --interactive --tty docker-example:latest

Inside the docker container, inside a python shell, torch.cuda.is_available() would then return True.

Upvotes: 5

I got it working after many, many tries. Posting the answer here in case it helps anyone.

Basically, I installed pytorch and torchvision through pip (from within the conda environment) and rest of the dependencies through conda as usual.

This is how the final Dockerfile looks:

# Use nvidia/cuda image
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04

# set bash as current shell
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]

# install anaconda
RUN apt-get update
RUN apt-get install -y wget bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git mercurial subversion && \
        apt-get clean
RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -O ~/anaconda.sh && \
        /bin/bash ~/anaconda.sh -b -p /opt/conda && \
        rm ~/anaconda.sh && \
        ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
        echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
        find /opt/conda/ -follow -type f -name '*.a' -delete && \
        find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
        /opt/conda/bin/conda clean -afy

# set path to conda
ENV PATH /opt/conda/bin:$PATH

# setup conda virtual environment
COPY ./requirements.yaml /tmp/requirements.yaml
RUN conda update conda \
    && conda env create --name camera-seg -f /tmp/requirements.yaml

RUN echo "conda activate camera-seg" >> ~/.bashrc
ENV PATH /opt/conda/envs/camera-seg/bin:$PATH

And this is how the requirements.yaml looks like:

name: camera-seg
  - defaults
  - conda-forge
  - python=3.6
  - pip
  - numpy
  - pillow
  - yaml
  - pyyaml
  - matplotlib
  - jupyter
  - notebook
  - tensorboardx
  - tensorboard
  - protobuf
  - tqdm
  - pip:
    - torch
    - torchvision

Then I build the container using the command docker build -t camera-seg . and PyTorch is now being able to recognize CUDA.

Upvotes: 19

