pherath
pherath

Reputation: 93

Setup Tensorflow 2.4 on Ubuntu 20.04 with GPU without sudo

I have access to a virtual machine with Ubuntu 20.04 setup and GPUs. Sysadmins already installed latest Cuda drivers, but unfortunately that's not enough to use GPUs in Tensorflow, as each version of TF can be very picky when it comes to the particular set of Cuda Toolkit + CuDNN versions. I don't have sudo rights, so I need to install everything locally.

nvidia-smi

returns Driver Version: 465.19.01 CUDA Version: 11.3

python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU');"

returns

2021-05-11 10:56:26.737279: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-11 10:56:26.737338: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-05-11 10:56:28.313896: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-11 10:56:28.315540: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-11 10:56:28.324232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-11 10:56:28.324707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:05.0 name: NVIDIA Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.324867: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-11 10:56:28.325293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:00:06.0 name: NVIDIA Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.325438: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325563: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325706: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325820: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325931: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326028: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326117: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326215: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326230: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

which shows GPUs won't be used in the TF application.

I had to spent some time setting up the VM, so I will post my solution step by step below.

Upvotes: 2

Views: 4197

Answers (1)

pherath
pherath

Reputation: 93

Instructions to setup Tensorflow 2.4.x (tested for 2.4.1) on an Ubuntu 20.04 environment without admin rights. It is assumed that a sysadmin already installed the latest Cuda drivers. It consists of install Cuda 11.0 toolkit + CuDNN 8.2.0.

Instructions below will install CUDA 11.0 (tested to work for Tensorflow 2.4.1) under directory /home/pherath/cuda_toolkits/cuda-11.0 without sudo rights.

Step 1. Download CUDA 11.0

wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
chmod +x cuda_11.0.2_450.51.05_linux.run

Step 2, Option 1: For a quick automatized form, use the following

./cuda_11.0.2_450.51.05_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

Step 2, Option 2: Here is a visual step-by-step guide

./cuda_11.0.2_450.51.05_linux.run

Continue, then accept the EULA.

Leave only Cuda Toolkit checked, uncheck everything else. Then go to Options.

Go into Toolkit Options.

Uncheck everything, then go to Change Toolkit Install Path and replace it with /home/pherath/cuda_toolkits/cuda-11.0 After this step, proceed with Install.

Step 3. Download CUDA 11.0 patch

wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run
chmod +x cuda_11.0.3_450.51.06_linux.run

Step 4. Option 1: Quick and silent mode

./cuda_11.0.3_450.51.06_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

Step 4. Option 2: GUI mode Repeat exact steps of Step 2, Option 2.

Installation might give an error. When checking the logs, the error I saw suggests that there might be a bug in the installation script. The only offending term is the symbolic link of one file.

[ERROR]: boost::filesystem::create_symlink: File exists: "libcuinj64.so.11.0", "/home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib/libcuinj64.so"

I came across several other single errors in various distribution attempts (e.g., on Ubuntu 16.04):
libcuinj64.so.11.0, libaccinj64.so.11.0, libnvrtc-builtins.so.11.0

This error can be fixed with the following 2 lines

cd /home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib # move to the dir of the offending line
ln -s libaccinj64.so.11.0 libaccinj64.so #reorder such that symbolic link and target are in correct order (we need libaccinj64.so -> libaccinj64.so.11.0)

Step 5. Download CuDNN 8.2.0

cd /home/pherath/cuda_toolkits # move back to the parent of previous dir

You will need to download CuDNN .tgz file from CuDNN archives, I used v8.2.0. This step will require you to create an account at CuDNN and download through web interface. If you don’t have GUI on the machine you are setting up tensorflow, I suggest using "Link Redirect Trace" add-on to track the exact link the file would be downloaded from (here is a google chrome add-on link). You can trace the link using your local computer with GUI, then use wget to download the traced link on the VM. Note that there is a relatively short lifetime of this traced link.

After downloading, the name will be still encrypted, rename it back to .tgz by

mv $some_ambiguous_name cudnn-11.3-linux-x64-v8.2.0.53.tgz

Now we unzip at the parent of the cuda installation dir

tar -xvzf cudnn-11.3-linux-x64-v8.2.0.53.tgz # this will extract things under a dir called 'cuda'

Now we need to copy all lib64 and include to the corresponding dirs under cuda toolkit installation

cp -fv cuda/lib64/*.* cuda-11.0/lib64/.
cp -fv cuda/include/*.* cuda-11.0/include/.

Step 6. Create/append/prepend PATH and LD_LIBRARY_PATH environment variables.

Add the following lines to the end of your ~/.bashrc (otherwise, make sure to extend the corresponding environment vars for each bash you'll run your TF scripts from).

export CUDA11=/home/pherath/cuda_toolkits/cuda-11.0
export PATH=$CUDA11/bin:$PATH
export LD_LIBRARY_PATH=$CUDA11/lib64:$CUDA11/extras/CUPTI/lib64:$LD_LIBRARY_PATH

Start either new terminal or

source ~/.bashrc 

in each existing terminal.

Check if installation worked

You can run the following lines to test if TF 2.4.1 + profiler works:

conda create -n tf python==3.7 -y  # create a python environment
conda activate tf #activate the virtual environment (here conda)
pip install tensorflow==2.4.1 # install tf 2.4.1
python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU'); tf.profiler.experimental.start('.'); tf.profiler.experimental.stop()" # test to see if TF with GPU works

#########################################################################

If you want to instead install Cuda Toolkit 10.2 on Ubuntu 20.04 LTS, then the single liner installation code changes accordingly (need to add library_path, and override the complaint of mismatching gcc version).

./cuda_10.2.89_440.33.01_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-10.2 --librarypath=/home/pherath/cuda_toolkits/cuda-10.2 --override

Keep in mind that you need to repeat this process for also the patches of cuda toolkit 10.2. Afterwards, you would need to download corresponding cuDNN and copy lib64 & include into cuda toolkit's directory (same as instructions above).

#########################################################################

If still getting errors, there is a good chance that you don't have the right cuda/nvidia drivers installed. For fixing this, you will need sudo rights!

1.

First, purge all cuda/nvidia content (I cannot add reference due to limited reputation..); basically run the lines below with sudo rights. apt clean; apt update; apt purge cuda; apt purge nvidia-*; apt autoremove; apt install cuda

2.

Follow instructions from google https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#ubuntu-driver-steps

3.

Reboot the machine.

Upvotes: 6

Related Questions