adamconkey
adamconkey

Reputation: 4745

Best practice for allocating GPU and CPU resources in TensorFlow

I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:

# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?

Upvotes: 1

Views: 1832

Answers (1)

user11530462
user11530462

Reputation:

Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.

Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.

Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.

You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.

Also if you go into the link, it states -

Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.

There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.

For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.

Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.

Hope this answers your question. Happy Learning.

Upvotes: 1

Related Questions