Why can't I run tensorflow session on CPU while one GPU device's memory is all allocated?

Question

From the tensorflow website (https://www.tensorflow.org/guide/using_gpu) I found the following code to manually specify the use of a CPU instead of a GPU:

# Creates a graph.
with tf.device('/cpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

I tried running this on my machine (with 4 GPUs) and got the following error:

2018-11-05 10:02:30.636733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:18:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-11-05 10:02:30.863280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3b:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-11-05 10:02:31.117729: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11721506816
Traceback (most recent call last):
  File "./tf_test.py", line 10, in 
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
  File ".../anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1566, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File ".../anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 636, in __init__
    self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

It seems that when I create the session, tensorflow tries to initialize a stream executor on all devices. Unfortunately, one of the GPUs is being used by my colleague right now. I would hope that his full use of one GPU would not prevent me from using another device (whether GPU or CPU), but this does not seem to be the case.

Does anyone know a workaround to this? Perhaps something to add to the config? Is this something that could be fixed in tensorflow?

FYI... here is the output of "gpustat -upc":

  Mon Nov  5 10:19:47 2018
[0] GeForce GTX 1080 Ti | 36'C,   0 % |    10 / 11178 MB |
[1] GeForce GTX 1080 Ti | 41'C,   0 % |    10 / 11178 MB |
[2] GeForce GTX 1080 Ti | 38'C,   0 % | 11097 / 11178 MB | :python2/148901(11087M)
[3] GeForce GTX 1080 Ti | 37'C,   0 % |    10 / 11178 MB |

Jed · Accepted Answer

OK... so with the help of my colleague, I have a workable solution. The key is, in fact, a modification to the config. Specifically, something like this:

config.gpu_options.visible_device_list = '0'

This will ensure that tensorflow only sees GPU 0.

In fact, I was able to run the following:

#!/usr/bin/env python                                                                                                                                                                                                                        

import tensorflow as tf

with tf.device('/gpu:2'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.                                                                                                                                                                                   
config=tf.ConfigProto(log_device_placement=True)
config.gpu_options.visible_device_list = '0,1,3'
sess = tf.Session(config=config)
# Runs the op.                                                                                                                                                                                                                               
print(sess.run(c))

Notice that this code actually specifies to run on GPU 2 (which you might remember is the one that is full). This is an important point... the GPUs are renumbered according to the visible_device_list, so in the above code, when we say "with gpu:2", this is referring to the 3rd GPU in the list ('0,1,3'), which is actually GPU 3. This can bite you if you try this:

#!/usr/bin/env python                                                                                                                                                                                                                        

import tensorflow as tf

with tf.device('/gpu:1'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.                                                                                                                                                                                   
config=tf.ConfigProto(log_device_placement=True)
config.gpu_options.visible_device_list = '1'
sess = tf.Session(config=config)
# Runs the op.                                                                                                                                                                                                                               
print(sess.run(c))

The problem is that its looking for the 2nd GPU in the list, but there is only one GPU in the visible list. The error you will get is as follows:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'a': Operation was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ]. Make sure the device specification refers to a valid device.
     [[Node: a = Const[dtype=DT_FLOAT, value=Tensor, _device="/device:GPU:1"]()]]

It still seems odd to me that I must specify a GPU list when I want to run on the CPU. I tried using an empty list and it failed, so if all 4 GPUs were in use, I would not have a workaround. Anyone else have a better idea?

Why can't I run tensorflow session on CPU while one GPU device's memory is all allocated?

Answers (1)

Related Questions

Why can&#39;t I run tensorflow session on CPU while one GPU device&#39;s memory is all allocated?

Answers (1)

Related Questions

Why can't I run tensorflow session on CPU while one GPU device's memory is all allocated?