Reputation: 13
I have created tf records
files that are stored on a google storage bucket. I have a code running on ml-engine to train a model using the data in these tf records
Each tf record file contains a batch of 20 examples and is approximately 8Mb size (Mega bytes). There are several thousands of files on the bucket.
My problem is that it litteraly takes forever to start the training. I have to wait about 40 minutes between the moment where the package is loaded and the moment where the training actually starts. I am guessing this is the time necessary to download the data and fill the queues?
The code is (slightly simplified for sake of conciseness):
# Create a queue which will produce tf record names
filename_queue = tf.train.string_input_producer(files, num_epochs=num_epochs, capacity=100)
# Read the record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Map for decoding the serialized example
features = tf.parse_single_example(
serialized_example,
features={
'data': tf.FixedLenFeature([], tf.float32),
'label': tf.FixedLenFeature([], tf.int64)
})
train_tensors = tf.train.shuffle_batch(
[features['data'], features['label']],
batch_size=30,
capacity=600,
min_after_dequeue=400,
allow_smaller_final_batch=True
enqueue_many=True)
I have checked that my bucket and my job share the same region
parameter.
I don't understand what is taking so long: it should just be a matter of downloading a few hundreds Mbs (a few tens of tf records files should be enough to have more than min_after_dequeue
elements in the queue).
Any idea of what am I missing, or where the problem might be?
Thanks
Upvotes: 0
Views: 146
Reputation: 13
Sorry, my bad. I was using a custom function to:
Turns out this is a very bad idea when dealing with thousands of files on gs://
I have removed this "sanity" check and it's working fine now.
Upvotes: 1