Google ml-engine: takes forever to fill the queue

Question

I have created tf records files that are stored on a google storage bucket. I have a code running on ml-engine to train a model using the data in these tf records

Each tf record file contains a batch of 20 examples and is approximately 8Mb size (Mega bytes). There are several thousands of files on the bucket.

My problem is that it litteraly takes forever to start the training. I have to wait about 40 minutes between the moment where the package is loaded and the moment where the training actually starts. I am guessing this is the time necessary to download the data and fill the queues?

The code is (slightly simplified for sake of conciseness):

    # Create a queue which will produce tf record names
    filename_queue = tf.train.string_input_producer(files, num_epochs=num_epochs, capacity=100)

    # Read the record
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    # Map for decoding the serialized example
    features = tf.parse_single_example(
        serialized_example,
        features={
            'data': tf.FixedLenFeature([], tf.float32),
            'label': tf.FixedLenFeature([], tf.int64)
        })

    train_tensors = tf.train.shuffle_batch(
        [features['data'], features['label']],
        batch_size=30,
        capacity=600,
        min_after_dequeue=400,
        allow_smaller_final_batch=True
        enqueue_many=True)

I have checked that my bucket and my job share the same region parameter.

I don't understand what is taking so long: it should just be a matter of downloading a few hundreds Mbs (a few tens of tf records files should be enough to have more than min_after_dequeue elements in the queue).

Any idea of what am I missing, or where the problem might be?

Thanks

pbp · Accepted Answer

Sorry, my bad. I was using a custom function to:

Verify that each file passed as a tf record actually exists.
Expand wild-card characters, if any

Turns out this is a very bad idea when dealing with thousands of files on gs://

I have removed this "sanity" check and it's working fine now.

Google ml-engine: takes forever to fill the queue

Answers (1)

Related Questions