Reputation: 2507
I am using tf.data.experimental.make_csv_dataset
in tensorflow (TF1.14 and TF2.0) to read a csv file consisting 3 columns; index, column1 and column2. For me only column 1 and column2 are important. Each element in column1 is an array of shape (1,4) and column2 has (1,1). On this dataset, when I use tf.data.shuffle(buffer_size = some_number)
for shuffling, it takes a lot of time to do this shuffling with a message Filling Up the shuffle buffer
. My question is if there is a way to shuffle the dataset by using the indices of the column1/column2, because this might not take so much time for shuffling since it is only the indices.
Upvotes: 1
Views: 846
Reputation: 14495
My question is if there is a way to shuffle the dataset by using the indices of the column1/column2, because this might not take so much time for shuffling since it is only the indices
No, unfortunately not. Not in that way.
The reason is that a tf.data.Dataset
object is inherently lazily loaded. It is deliberately so as it can represent arbitrarily large (even infinite) datasets so it wouldnt make sense to try to load it all into memory or do all the pre-processing up front.
This means that, whilst it would (of course) be feasible to read and shuffle the index we could not then access the nth element from the original dataset (at least not cheaply).
It's worth mentioning that the shuffle buffer only needs to be filled once so the delay will only happen at the start of training (and the start of each epoch if shuffling each epoch).
A sensible workaround that you may well have already considered is to load the dataset once with the shuffle then write it out somewhere (maybe a tfrecord format) with all the rows preshuffled.
Upvotes: 1