orome
orome

Reputation: 48526

What can I do to run specific TensorFlow calculations on a CPU as part of a GPU implementation?

I'm puzzled about how to efficiently assign my TensorFlow operations and variables to devices. It's clear that, at least for my implementation of a basic convolutional neural network, placing as many operations as possible on a GPU is desirable. But the GPU I currently have access to has limited memory and results in many warnings of the form

Ran out of memory trying to allocate 2.60GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

and occasional crashes for certain specific operations, like

Ran out of memory trying to allocate 83.74MiB.  See logs for memory state.
Resource exhausted: OOM when allocating tensor with shape[28000,1,28,28]

These can be avoided by placing variables on CPUs, but in my implementation, this results in training epochs taking 10 times as long to compute.

Clearly the ideal policy is to identify specific chunks of code that generate errors, and attempt to place only those on CPUs. But it is unclear to me how to do this, because those calculations can't be isolated from others that require GPU placement to achieve efficiencies.

For example, simply generating predictions on a test set with something like

evals = sess.run(tf.argmax(y, 1), feed_dict={x: use_x_all})

where x is a tf.placeholder of inputs to my model, and y are the output activations of my network, produces the above error when use_x_all is a large array (here with 28000 examples). Attempting to put this calculation alone on a CPU fails, presumably because the network evaluation producing y is on the GPU.

Because of this I (seem to) need to resort to approaches like

use_x_all, _ = data_loader.stack_data(use_data, as_cols=False)
use_x_split = np.split(use_x_all, splits)
for use_x in use_x_split:
    # ... (full example below)
    evals_part = sess.run(tf.argmax(y, 1), feed_dict={x: use_x})
    # accumulate evals

which clearly doesn't scale.

Is there a better way? Specifically:

or, alternatively


Actually, I'm surprised that the latter isn't part of the TensorFlow API. Shouldn't it be possible to automatically break up calculations that don't fit on a device, without requiring code such as that above?


Full example from my code:

f = open('{0:s}/{1:s}_{2:3.0f}.csv'.format(FLAGS.pred_dir, FLAGS.session_name,
                                                       10000*float(sess.run(accuracy, feed_dict=valid_feed))), 'w')
f.write('ImageId,Label\n')
use_x_all, _ = data_loader.stack_data(use_data, as_cols=False)
            use_x_split = np.split(use_x_all, splits)
last = 0
buff = ''
for use_x in use_x_split:
    evals = sess.run(tf.argmax(y, 1), feed_dict={x: use_x})
    f.write('\n'.join('{0},{1}'.format(r[0]+ last, r[1]) for r in enumerate(evals, start=1)))
    last += int(len(use_x_all)/splits)
    if last < len(use_x_all):
        f.write('\n')
f.close()

Upvotes: 3

Views: 2423

Answers (2)

rdadolf
rdadolf

Reputation: 1248

Short answer: You can split computation, but you'll need to think a bit about the right way. Also, smaller batches is a reasonable idiom to think about here.

Long answer:

Heterogeneous placement is possible, but I think it's a bit more involved than you let on. Consider the following (abstract, simplified) setup:

setup

The problem here is that all the tensors required to evaluate this network will not fit on our GPU. Unfortunately, it's not as simple as "identifying specific chunks of code that generate errors", since memory allocation problems are usually an aggregate problem. I.E., the errors are simply a symptom of when the total is too large, not an indication of that particular allocation being too large---you're just seeing the straw that broke the camel's back, so to speak.

As you point out, we'd like to run calculations on the GPU for better throughput, but allocating haphazardly will move a lot of data over the PCI interface, throwing away any performance gains we would have gotten:

partitions

In the left partitioning scheme, we're moving far less data over the interface than the right.

Except that there's a myriad number of other ways that you could partition the computational graph to achieve different balances...

Other partitions

and each of these are going to behave differently in terms of performance. This is why TF doesn't auto-partition your graph. It's actually not a straightforward problem in the general case.

Tips for partitioning computation

Back to your problem in specific. Your approach (smaller batches) is one viable approach. If effectively shrinks the size of all allocation requests. Partitioning it, as suggested by others, might also a viable approach, but you should do it intelligently. Attempt to minimize inter-device memory movement, and try to place computationally dense pieces of the graph (like a chunk of convolutional layers) together on the same device.

One helpful tactic might be to compute (either by hand or using tensorboard) the size of the tensors in different pieces of your graph. This should give you a feel for how large pieces of the network are relative to each other. That, in turn, provides a reasonable guide to how to partition it.

Finally, if you're only doing inference (no training), another solution is to evaluate only some of the network at a time. This is sort of the dual approach to using smaller batches: instead of a large network with many, small batches, you could use a few small networks with a few large batches.

Upvotes: 6

fabmilo
fabmilo

Reputation: 48330

Is there a way to place calculations like the one above on a CPU and still have those calculations for the same graph (e.g. training) run on a GPU?

You can use explicit and nested device placement. Use the logging to see where the operations are placed.

config = tf.ConfigProto()
config.log_device_placement = True
s = tf.InteractiveSession(config=config)

with tf.device("/gpu:0"):
    m = tf.matmul(tf.constant([[1,2,3,4]]), tf.constant([[1],[1],[1],[1]]))
    with tf.device("/cpu:0"):
        m = m + 1

s.run(m)

Upvotes: -1

Related Questions