Tensorflow is unable to use all visible GPUs

Question

I have a cluster with 8 GPUs and I would like to run a python script on it. I know the script is fine, because it runs on a single GPU cluster. However, when trying to run on this 8 gpu cluster, I am receiving the following error message:

to use: AVX2 AVX512F FMA
2018-03-29 18:42:51.800702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3d:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.347624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3e:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:52.882324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:60:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:53.591909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:61:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.149671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 4 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b1:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:54.715701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 5 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b2:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.286011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 6 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:da:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.874676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 7 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:db:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-03-29 18:42:55.929779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix
2018-03-29 18:42:55.930506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1 2 3 4 5 6 7
2018-03-29 18:42:55.930524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 2:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 3:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 4:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 5:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 6:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 7:   Y Y Y Y Y Y Y Y
2018-03-29 18:42:55.930741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-03-29 18:43:00.106517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-29 18:43:00.572522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:43:01.039866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-29 18:43:01.512332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-29 18:43:02.036327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-29 18:43:02.679167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1) 
killed

It just blandly says killed and I'm not sure why this error occurs. I tried specifying just two GPUs using the following command:

CUDA_VISIBLE_DEVICES=0,1 python3 my_script.py

But that printed the following error:

Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-29 18:47:46.208490: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-29 18:47:46.210296: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
Aborted (core dumped)

I install tensorflow-gpu using the following commands:

pip3 install tensorflow-gpu 
pip3 install --upgrade tensorflow-gpu

Could this potentially have anything to do with "activating" tensorflow? I'm not sure how to do this on the cluster, since I'm not sure if this considered virtual environment

Gabriel Paz · Accepted Answer

You need to downgrade your cuDNN version. I solved this problem with 7.0.5.

Download cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0

Download .tar file from cuDNN v7.0.5 Library for Linux.

(on Ubuntu 16)

Before, you need to remove all cuDNN files:

sudo rm -rf /usr/local/cuda/include/cudnn.h
sudo rm -rf /usr/local/cuda/lib64/libcudnn*

Now extract your new cuDNN from the file downloaded:

tar xvzf cudnn-9.0-linux-x64-v7.tgz

Move new files to cuda directory:

sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include    
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64

Set permissions to this files:

sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Tensorflow is unable to use all visible GPUs

Answers (1)

Related Questions