Rishabh Sahrawat
Rishabh Sahrawat

Reputation: 2507

Is it possible to shuffle the dataset using the index of its elements?

I am using tf.data.experimental.make_csv_dataset in tensorflow (TF1.14 and TF2.0) to read a csv file consisting 3 columns; index, column1 and column2. For me only column 1 and column2 are important. Each element in column1 is an array of shape (1,4) and column2 has (1,1). On this dataset, when I use tf.data.shuffle(buffer_size = some_number) for shuffling, it takes a lot of time to do this shuffling with a message Filling Up the shuffle buffer. My question is if there is a way to shuffle the dataset by using the indices of the column1/column2, because this might not take so much time for shuffling since it is only the indices.

Upvotes: 1

Views: 846

Answers (1)

Stewart_R
Stewart_R

Reputation: 14495

My question is if there is a way to shuffle the dataset by using the indices of the column1/column2, because this might not take so much time for shuffling since it is only the indices

No, unfortunately not. Not in that way.

The reason is that a tf.data.Dataset object is inherently lazily loaded. It is deliberately so as it can represent arbitrarily large (even infinite) datasets so it wouldnt make sense to try to load it all into memory or do all the pre-processing up front.

This means that, whilst it would (of course) be feasible to read and shuffle the index we could not then access the nth element from the original dataset (at least not cheaply).

It's worth mentioning that the shuffle buffer only needs to be filled once so the delay will only happen at the start of training (and the start of each epoch if shuffling each epoch).

A sensible workaround that you may well have already considered is to load the dataset once with the shuffle then write it out somewhere (maybe a tfrecord format) with all the rows preshuffled.

Upvotes: 1

Related Questions