Reputation: 81
During training of my data, my GPU utilization is around 40%, and I clearly see that there is a datacopy operation that's using a lot of time, based on tensorflow profiler(see attached picture). I presume that "MEMCPYHtoD" option is copying the batch from CPU to GPU, and is blocking the GPU from being used. Is there anyway to prefetch data to GPU? or is there other problems that I am not seeing?
Here is the code for dataset:
X_placeholder = tf.placeholder(tf.float32, data.train.X.shape)
y_placeholder = tf.placeholder(tf.float32, data.train.y[label].shape)
dataset = tf.data.Dataset.from_tensor_slices({"X": X_placeholder,
"y": y_placeholder})
dataset = dataset.repeat(1000)
dataset = dataset.batch(1000)
dataset = dataset.prefetch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
Upvotes: 8
Views: 2173
Reputation: 2298
Prefetching to a single GPU:
prefetch_to_device
, e.g. by explicitly copying to the GPU with tf.data.experimental.copy_to_device(...)
and then prefetching. This allows to avoid the restriction that prefetch_to_device
must be the last transformation in a pipeline, and allow to incorporate further tricks to optimize the Dataset
pipeline performance (e.g. by overriding threadpool distribution).tf.contrib.data.AUTOTUNE
option for prefetching, which allows the tf.data
runtime to automatically tune the prefetch buffer sizes based on your system and environment.At the end, you might end up doing something like this:
dataset = dataset.apply(tf.data.experimental.copy_to_device("/gpu:0"))
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
Upvotes: 5
Reputation: 7089
I believe you can now fix this problem by using prefetch_to_device. Instead of the line:
dataset = dataset.prefetch(2)
do
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0', buffer_size=2))
Upvotes: 2