I.D.M
I.D.M

Reputation: 67

Tensorflow Dataset API - explanation of behavior

Using the code below, I would like to ask a few questions about what exactly is happening underneath.

dataset = tf.data.TFRecordDataset(filepath)
dataset = dataset.map(parse_function, num_parallel_calls=4)
dataset = dataset.repeat()
dataset = dataset.shuffle(1024)
dataset = dataset.batch(16)
iterator = dataset.make_one_shot_iterator()

1.dataset.map(parse_function, num_parallel_calls=4) - How many records are we loading here ? How much will fit in the memory or some fixed number ?

2.dataset = dataset.repeat() - What exactly do we repeat ? Currently loaded piece of data from point .1 ? If so, does it mean that we will not load the others anymore ?

3.How exactly does shuffle work?

4.Can we use repeat, shuffle and batch before map and work on file paths instead of files alone ?

Upvotes: 0

Views: 1336

Answers (2)

Matěj Račinský
Matěj Račinský

Reputation: 1804

  1. Data in Dataset API is lazy loaded, so it depends on later operations. Now you load 1024 samples at time because of the size of shuffle buffer. It needs to fill the shuffle buffer. Data will be then loaded lazily, when you will be fetching values from the iterator.
  2. You repeat loaded data, because the repeating is after the map function. This is why it's advised to shuffle before parsing data, because it's more memory friendly.
  3. The shuffle loads some data (depending on size od shuffle buffer), and shuffles that data.
  4. Yes, you can repeat, shuffle and then map, it is even advised in the performance guide. And there is also function which merges repeat and shuffle together here.

Upvotes: 1

Sharky
Sharky

Reputation: 4533

  1. Here you're loading entire dataset. It's usually not a good idea to apply map prior to batch. Tensorflow has a hard limit 2GB on tensor size. num_parallel_calls means number of map functions applied in parallel.
  2. dataset.repeat() without specified epoch value will repeat dataset indefinitely.
  3. Shuffle will randomly shuffle dataset with specified buffer value. In order to properly shuffle it's usually good to set this value to dataset length, and apply this function prior to batch.
  4. tf.data.TFRecordDatasetexpects filenames as input. Generally, preferred order is

    dataset = dataset.shuffle(shuffle_buffer).repeat()
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(map_func)
    

Take a look at https://www.tensorflow.org/guide/performance/datasets

Upvotes: 0

Related Questions