tf.data.Dataset - Why is the performance of my data pipeline not increasing when I cache examples?

Question

I'm currently trying to learn more about building efficient preprocessing pipelines with tf.data. As per this tutorial there should be a non negligible effect on performance when the data is cached.

I reduced my data pipeline to a very simple example to verify this effect.

import os
import tensorflow as tf

class ExperimentalDS:
    def __init__(self, hr_img_path, cache, repeat, shuffle_buffer_size=4096):
        self.hr_img_path = hr_img_path
        self.ids = os.listdir(self.hr_img_path)
        self.train_list = self.ids

        train_list_ds = tf.data.Dataset.list_files([f"{hr_img_path}/{fname}" for fname in self.train_list])

        train_hr_ds = train_list_ds.map(self.load_img)
        train_hr_ds = train_hr_ds.shuffle(shuffle_buffer_size)

        self.train_ds = train_hr_ds

        # should probably call shuffle again after caching
        if cache: self.train_ds.cache()
        self.train_ds = train_hr_ds.repeat(repeat)

    def get_train_ds(self, batch_size=8):
        return self.train_ds.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

    def load_img(self, fpath):
        img = tf.io.read_file(fpath)
        img = tf.image.decode_png(img)
        img = tf.image.convert_image_dtype(img, tf.float32)
        return img

The pipeline is basically just reading file names from a folder, loads images from these file names, shuffles the images and then either caches them or doesn't depending on the supplied parameter.

To evaluate the performance I mostly copied the benchmarking function from the before mentioned tutorial.

def benchmark_dataset(ds, num_steps):
    start = time.perf_counter()
    it = iter(ds)

    for i in range(num_steps):
        batch = next(it)
        if i % 100 == 0:
            print(".", end="")
    print()

    end = time.perf_counter()
    duration = end - start

    return duration

if __name__ == "__main__":
    num_steps = 1000
    batch_size = 8
    durations_no_cache = []
    durations_cached = []
    for i in range(num_steps):
        ds = ExperimentalDS("./test_data/benchmark/16", cache=False, repeat=-1)
        ds_train = ds.get_train_ds(batch_size=batch_size)
        durations_no_cache.append(benchmark_dataset(ds_train, num_steps))

    for i in range(num_steps):
        ds = ExperimentalDS("./test_data/benchmark/16", cache=True, repeat=-1)
        ds_train = ds.get_train_ds(batch_size=batch_size)
        durations_cached.append(benchmark_dataset(ds_train, num_steps))

    os.makedirs(SAVE_PATH, exist_ok=True)
    durations_no_cache = np.array(durations_no_cache)
    avg_duration_no_cache = np.average(durations_no_cache)

    durations_cached = np.array(durations_cached)
    avg_durations_cached = np.average(durations_cached)

    with open(f"{SAVE_PATH}/stats", "a+") as f:
        f.write("no cache:
")
        f.write(f"{num_steps} batches: {avg_duration_no_cache}s (avg)
")
        f.write(f"{batch_size*num_steps/avg_duration_no_cache:.5f} Images/s

")
        f.write("cached:
")
        f.write(f"{num_steps} batches: {avg_durations_cached}s (avg)
")
        f.write(f"{batch_size*num_steps/avg_durations_cached:.5f} Images/s")

I'm loading a very simple image dataset containing 16 images with dimensions of 128x128 for each image (so it should easily fit into memory). I repeat this dataset indefinitely and iterate over it for 1000 batches (batch size is 8) with caching and without caching record the runtime and then average these results over 1000 runs. Since these are quite a lot of runs, I would assume that there shouldn't be much variance. The benchmark was run on a GPU if it matters.

The results are very surprising to me. The benchmark without caching is actually slightly faster:

no cache:
1000 batches: 2.434403038507444s (avg)
3286.22659 Images/s

cached:
1000 batches: 2.439824645938235s (avg)
3278.92417 Images/s

I know there are a few other things to improve the performance, like parallel and vectorized mapping, but it shouldn't have any effect in regards to comparing caching vs not caching.

Can someone help my out on this? What am I missing here?

edit: In the comments it was suggested by @Szymon Maszke that I should benchmark iteration over multiple epochs and actually feed the data to a network. So I did that, but the cached and not cached dataset perform pretty much the same. Really not sure why.

edit2: After fixing the mistake pointed out by @AAudibert, it's working as expected now. Actually it's working way better than expected to be honest:

no cache:
1000 batches: 2.624478972374927s (avg)
3048.22408 Images/s

cached:
1000 batches: 0.17946020061383025s (avg)
44578.12915 Images/s

AAudibert · Accepted Answer

This statement does nothing:

if cache: self.train_ds.cache()

It should be:

if cache: train_hr_ds = train_hr_ds.cache()

Like other dataset transformations, cache returns a new dataset instead of modifying an existing dataset.

tf.data.Dataset - Why is the performance of my data pipeline not increasing when I cache examples?

Answers (1)

Related Questions