slim.dataset_data_provider.DatasetDataProvider with num_epochs=1 throws error

Question

I am using the relatively new tf.slim Dataset, DatasetDataProvider pattern. The following code shows the key fragments:

with tf.Graph().as_default():    

    # get the dataset split
    dataset = util.get_split(train_or_eval,
                             args.tfrecord_folder, 
                             0, 
                             args.eval_set_size,
                             crop_size, 
                             file_pattern=file_pattern)


    features, labels = util.load_batch(dataset,
                                       batch_size=args.eval_batch_size, 
                                       num_readers=10,
                                       num_epochs=1,
                                       is_training=True)

    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())

        # start the queue runner
        with slim.queues.QueueRunners(sess): 

               ...run some ops...

Here's the definition of load_batch:

def load_batch(dataset, batch_size=64, is_training=False, 
               num_epochs=None, common_queue_capacity=256,
               common_queue_min=32, num_readers=None):

     shuffle = True

     # create the data provider
     data_provider = slim.dataset_data_provider.DatasetDataProvider(
                              dataset, 
                              num_readers=num_readers,
                              shuffle=shuffle, 
                              num_epochs=num_epochs, 
                              common_queue_capacity= 
                                  common_queue_capacity, 
                              common_queue_min= common_queue_min, 
                              seed=5)

     # get the tensors from the data provider
     images, labels = data_provider.get(['image_raw','label'])

     # batch up some training data
     images, labels = tf.train.batch([image_raw, label],
                                      batch_size=batch_size,
                                      num_threads=5,
                                      allow_smaller_final_batch=True,
                                      capacity=2 * batch_size)

     return images, labels

This works fine when num_epochs=None (which according to the comments in the source means that a file of tfrecords can be read an infinite number of times), but fails when num_epochs=1. Here's the error message:

Out of range: FIFOQueue '_9_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)

Obviously, I need to be able to run an eval step without repeating the examples to get good accuracy and confusion matrix numbers. Any thoughts would be appreciated...

Per the request in the comments I am adding the stack trace. I am running this job in Google Cloud ML so its easiest to show it this way. The logs have a series of paired messages as follows:

Out of range: FIFOQueue '_6_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)[[Node: batch = QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]

[[Node: batch = QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]

Final Stack Trace is

"The replica master 0 exited with a non-zero status of 1. Termination reason: Error.Traceback (most recent call last): [...] File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 505, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 113, in run run_eval(args) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 285, in run_eval is_training=True) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 210, in load_batch capacity=3 * batch_size) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 872, in batch name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 665, in _batch dequeued = queue.dequeue_up_to(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 499, in dequeue_up_to self._queue_ref, n=n, component_types=self._dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1402, in _queue_dequeue_up_to_v2 timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in init self._traceback = _extract_stack()

OutOfRangeError (see above for traceback): FIFOQueue '_6_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0) [[Node: batch = QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]] To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?...

slim.dataset_data_provider.DatasetDataProvider with num_epochs=1 throws error

Answers (1)

Related Questions