mbsariyildiz
mbsariyildiz

Reputation: 43

Capacity of queue in tf.data.Dataset

I have problem with Tensorflow's new input pipeline mechanism. When I create a data pipeline with tf.data.Dataset, which decodes jpeg images and then loads them into a queue, it tries to load as much image as it can into the queue. If throughput of loading images is greater than throughput of images processed by my model, then the memory usage increase unboundedly.

Below is the code snippet for building pipeline with tf.data.Dataset

def _imread(file_name, label):
  _raw = tf.read_file(file_name)
  _decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
  _resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
  _scaled = (_resized / 127.5) - 1.0
  return _scaled, label

n_samples = image_files.shape.as_list()[0]
dset = tf.data.Dataset.from_tensor_slices((image_files, labels))
dset = dset.shuffle(n_samples, None)
dset = dset.repeat(hps.n_epochs)
dset = dset.map(_imread, hps.batch_size * 32)
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)

Here image_files is a constant tensor and contains filenames of 30k images. Images are resized to 256x256x3 in _imread.

If a build a pipeline with the following snippet:

# refer to "https://www.tensorflow.org/programmers_guide/datasets"
def _imread(file_name, hps):
  _raw = tf.read_file(file_name)
  _decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
  _resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
  _scaled = (_resized / 127.5) - 1.0
  return _scaled

n_samples = image_files.shape.as_list()[0]

image_file, label = tf.train.slice_input_producer(
  [image_files, labels],
  num_epochs=hps.n_epochs,
  shuffle=True,
  seed=None,
  capacity=n_samples,
)

# Decode image.
image = _imread(image_file, 

images, labels = tf.train.shuffle_batch(
  tensors=[image, label],
  batch_size=hps.batch_size,
  capacity=hps.batch_size * 64,
  min_after_dequeue=hps.batch_size * 8,
  num_threads=32,
  seed=None,
  enqueue_many=False,
  allow_smaller_final_batch=True
)

Then memory usage is almost constant throughout training. How can I make tf.data.Dataset to load fixed amount of samples? Is the pipeline I create with tf.data.Dataset correct? I think that buffer_size argument in tf.data.Dataset.shuffle is for image_files and labels. So it shouldn't be a problem for storing 30k strings, right? Even if 30k images were to be loaded, it would require 30000*256*256*3*8/(1024*1024*1024)=43GB of memory. However it uses 59GB of 61GB system memory.

Upvotes: 2

Views: 1198

Answers (1)

David Parks
David Parks

Reputation: 32081

This will buffer n_samples, which looks to be your entire dataset. You might want to cut down on the buffering here.

dset = dset.shuffle(n_samples, None)

You might as well just repeat forever, repeat won't buffer (Does `tf.data.Dataset.repeat()` buffer the entire dataset in memory?)

dset = dset.repeat()

You are batching and then prefetching hps.batch_size # of batches. Ouch!

dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)

Let's say hps.batch_size=1000 to make a concrete example. The first line above creates a batch of 1000 images. The 2nd line above creates 2000 batches of each 1000 images, buffering a grand total of 2,000,000 images. Oops!

You meant to do:

dset = dset.batch(hps.batch_size)
dset = dset.prefetch(2)

Upvotes: 2

Related Questions