Capacity of queue in tf.data.Dataset

Question

I have problem with Tensorflow's new input pipeline mechanism. When I create a data pipeline with tf.data.Dataset, which decodes jpeg images and then loads them into a queue, it tries to load as much image as it can into the queue. If throughput of loading images is greater than throughput of images processed by my model, then the memory usage increase unboundedly.

Below is the code snippet for building pipeline with tf.data.Dataset

def _imread(file_name, label):
  _raw = tf.read_file(file_name)
  _decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
  _resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
  _scaled = (_resized / 127.5) - 1.0
  return _scaled, label

n_samples = image_files.shape.as_list()[0]
dset = tf.data.Dataset.from_tensor_slices((image_files, labels))
dset = dset.shuffle(n_samples, None)
dset = dset.repeat(hps.n_epochs)
dset = dset.map(_imread, hps.batch_size * 32)
dset = dset.batch(hps.batch_size)
dset = dset.prefetch(hps.batch_size * 2)

Here image_files is a constant tensor and contains filenames of 30k images. Images are resized to 256x256x3 in _imread.

If a build a pipeline with the following snippet:

# refer to "https://www.tensorflow.org/programmers_guide/datasets"
def _imread(file_name, hps):
  _raw = tf.read_file(file_name)
  _decoded = tf.image.decode_jpeg(_raw, channels=hps.im_ch)
  _resized = tf.image.resize_images(_decoded, [hps.im_width, hps.im_height])
  _scaled = (_resized / 127.5) - 1.0
  return _scaled

n_samples = image_files.shape.as_list()[0]

image_file, label = tf.train.slice_input_producer(
  [image_files, labels],
  num_epochs=hps.n_epochs,
  shuffle=True,
  seed=None,
  capacity=n_samples,
)

# Decode image.
image = _imread(image_file, 

images, labels = tf.train.shuffle_batch(
  tensors=[image, label],
  batch_size=hps.batch_size,
  capacity=hps.batch_size * 64,
  min_after_dequeue=hps.batch_size * 8,
  num_threads=32,
  seed=None,
  enqueue_many=False,
  allow_smaller_final_batch=True
)

Then memory usage is almost constant throughout training. How can I make tf.data.Dataset to load fixed amount of samples? Is the pipeline I create with tf.data.Dataset correct? I think that buffer_size argument in tf.data.Dataset.shuffle is for image_files and labels. So it shouldn't be a problem for storing 30k strings, right? Even if 30k images were to be loaded, it would require 30000*256*256*3*8/(1024*1024*1024)=43GB of memory. However it uses 59GB of 61GB system memory.

Capacity of queue in tf.data.Dataset

Answers (1)

Related Questions