John Hyatt
John Hyatt

Reputation: 21

tensorflow datasets: more efficient to vectorize with unbatching (batch -> map -> unbatch) or just map?

TensorFlow recommends batching of datasets before transformations with map in order to vectorize the transformation and reduce overhead: https://www.tensorflow.org/guide/data_performance#vectorizing_mapping

However, there are cases where you want to perform transformations on the dataset and then do something (e.g., shuffle) on the UNBATCHED dataset.

I haven't been able to find anything to indicate which is more efficient:

1) dataset.map(my_transformations)

2) dataset.batch(batch_size).map(my_transformations).unbatch()

(2) has reduced map overhead from having vectorized with batch, but has additional overhead from having to unbatch the dataset after.

I could also see it being that there is not a universal rule. Short of testing every time I try a new dataset or transformation (or hardware!), is there a good rule of thumb here? I have seen several examples online use (2) without explanation, but I have no intuition on this subject, so...

Thanks in advance!

EDIT: I have since found that in at least some cases, (2) is MUCH less efficient than (1). For example, on our image dataset, applying random flips and rotations (with .map and the built-in TF functions tf.image.random_flip_left_right, tf.image.random_flip_up_down, and tf.image.rot90) per epoch for data augmentation takes 50% longer with (2). I still have no idea when to expect this to be the case, or not, but the tutorials' suggested approach is at least sometimes wrong.

Upvotes: 1

Views: 828

Answers (1)

John Hyatt
John Hyatt

Reputation: 21

The answer is (1). https://github.com/tensorflow/tensorflow/issues/40386

TF is modifying the documentation to reflect that the overhead from unbatch will usually (always?) be higher than the savings from vectorized transformations.

Upvotes: 1

Related Questions