Reputation: 354
I have a dataset of around 1M examples. I each example to a separate .tfrecord file, which resulted in around 500GB sitting in some network location.
Reading multiple small files from this network location is extremely slow, so I'm thinking about grouping around 100 examples into one .tfrecord file.
I am worried though, that examples from the same .tfrecords file will always appear in the same minibatch (or one minibatch after each other), which is bad for the proper mixing of training data I want to have.
my input pipeline is the following:
I have a tf.train.string_input_producer(files, capacity=100000)
for the filenames queue, using TFRecordReader.read
to read from the filenames queue, and use tf.train.batch
that creates an examples queue and returns a batch from it using dequeue_many
.
I fear that once the filenames queue dequeues a filename, all examples from it will be read and enqueued into the examples FIFO queue created by tf.train.batch
, which will result in the same examples being in the same minibatches over and over.
Is it really going to have the same examples in the same minibatch over and over? If so, should I create a Shuffle queue for examples, instead of using tf.train.batch
?
Upvotes: 0
Views: 1189
Reputation: 222761
One of the points of TFRecord is to store many files in the same location to overcome the problem of opening/closing many files. So your approach of one tfrecord per one example does not make sense. You can put even all examples in one file or have 10k per file. Regarding shuffling: there are two types shuffling which serve different purposes and shuffle different things:
tf.train.string_input_producer
shuffle: Boolean. If true, the strings are randomly shuffled within each epoch.. So if you have a few files ['file1', 'file2', ..., 'filen']
this randomly selects a file from this list. If case of false, the files follow one after each other.tf.train.shuffle_batch
Creates batches by randomly shuffling tensors. So it takes batch_size
tensors from your queue (you will need to create a queue with tf.train.start_queue_runners
) and shuffles them.Upvotes: 3