tf.data pipeline design for optimized performance

Question

I am new to TensorFlow and I would like to know of there is any specific order to set the dataset using tfdata. For example:

    data_files = tf.gfile.Glob("%s%s%s" % ("./data/cifar-100-binary/", self.data_key, ".bin"))
    data = tf.data.FixedLengthRecordDataset(data_files, record_bytes=3074)
    data = data.map(self.load_transform)
    if self.shuffle_key:
        data = data.shuffle(5000)

    data = data.batch(self.batch_size).repeat(100)
    iterator = data.make_one_shot_iterator()
    img, label = iterator.get_next()
    # label = tf.one_hot(label, depth=100)
    print('img_shape:', img.shape)

In this case I read the data then shuffle the data followed by batch and repeat specifications. With this method my computer's RAM increases by 2%

and then I tried one more method:

    data_files = tf.gfile.Glob("%s%s%s" % ("./data/cifar-100-binary/", self.data_key, ".bin"))
    data = tf.data.FixedLengthRecordDataset(data_files, record_bytes=3074)
    data = data.map(self.load_transform)
    data = data.batch(self.batch_size).repeat(100)
    if self.shuffle_key:
        data = data.shuffle(5000)
    iterator = data.make_one_shot_iterator()
    img, label = iterator.get_next()
    # label = tf.one_hot(label, depth=100)
    print('img_shape:', img.shape)

so with this case when I first specify the batch size, repeat and then shuffle the RAM utilization increases by 40% (I do not know why) it would be great if someone helps me figure that out. So is there a sequence which I should always follow to define the dataset in tensorflow using tf.data ?

tf.data pipeline design for optimized performance

Answers (1)

Related Questions