cruvadom
cruvadom

Reputation: 354

Mixing properties of multiple records in one .tfrecords file

I have a dataset of around 1M examples. I each example to a separate .tfrecord file, which resulted in around 500GB sitting in some network location.

Reading multiple small files from this network location is extremely slow, so I'm thinking about grouping around 100 examples into one .tfrecord file.

I am worried though, that examples from the same .tfrecords file will always appear in the same minibatch (or one minibatch after each other), which is bad for the proper mixing of training data I want to have.

my input pipeline is the following: I have a tf.train.string_input_producer(files, capacity=100000) for the filenames queue, using TFRecordReader.read to read from the filenames queue, and use tf.train.batch that creates an examples queue and returns a batch from it using dequeue_many.

I fear that once the filenames queue dequeues a filename, all examples from it will be read and enqueued into the examples FIFO queue created by tf.train.batch, which will result in the same examples being in the same minibatches over and over.

Is it really going to have the same examples in the same minibatch over and over? If so, should I create a Shuffle queue for examples, instead of using tf.train.batch?

Upvotes: 0

Views: 1189

Answers (1)

Salvador Dali
Salvador Dali

Reputation: 222761

One of the points of TFRecord is to store many files in the same location to overcome the problem of opening/closing many files. So your approach of one tfrecord per one example does not make sense. You can put even all examples in one file or have 10k per file. Regarding shuffling: there are two types shuffling which serve different purposes and shuffle different things:

  • tf.train.string_input_producer shuffle: Boolean. If true, the strings are randomly shuffled within each epoch.. So if you have a few files ['file1', 'file2', ..., 'filen'] this randomly selects a file from this list. If case of false, the files follow one after each other.
  • tf.train.shuffle_batch Creates batches by randomly shuffling tensors. So it takes batch_size tensors from your queue (you will need to create a queue with tf.train.start_queue_runners ) and shuffles them.

Upvotes: 3

Related Questions