Reputation: 107
I am using the relatively new tf.slim Dataset, DatasetDataProvider pattern. The following code shows the key fragments:
with tf.Graph().as_default():
# get the dataset split
dataset = util.get_split(train_or_eval,
args.tfrecord_folder,
0,
args.eval_set_size,
crop_size,
file_pattern=file_pattern)
features, labels = util.load_batch(dataset,
batch_size=args.eval_batch_size,
num_readers=10,
num_epochs=1,
is_training=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
# start the queue runner
with slim.queues.QueueRunners(sess):
...run some ops...
Here's the definition of load_batch:
def load_batch(dataset, batch_size=64, is_training=False,
num_epochs=None, common_queue_capacity=256,
common_queue_min=32, num_readers=None):
shuffle = True
# create the data provider
data_provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
num_readers=num_readers,
shuffle=shuffle,
num_epochs=num_epochs,
common_queue_capacity=
common_queue_capacity,
common_queue_min= common_queue_min,
seed=5)
# get the tensors from the data provider
images, labels = data_provider.get(['image_raw','label'])
# batch up some training data
images, labels = tf.train.batch([image_raw, label],
batch_size=batch_size,
num_threads=5,
allow_smaller_final_batch=True,
capacity=2 * batch_size)
return images, labels
This works fine when num_epochs=None (which according to the comments in the source means that a file of tfrecords can be read an infinite number of times), but fails when num_epochs=1. Here's the error message:
Out of range: FIFOQueue '_9_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)
Obviously, I need to be able to run an eval step without repeating the examples to get good accuracy and confusion matrix numbers. Any thoughts would be appreciated...
Per the request in the comments I am adding the stack trace. I am running this job in Google Cloud ML so its easiest to show it this way. The logs have a series of paired messages as follows:
Out of range: FIFOQueue '_6_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)[[Node: batch = QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
[[Node: batch = QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
Final Stack Trace is
"The replica master 0 exited with a non-zero status of 1. Termination reason: Error.Traceback (most recent call last): [...] File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 509, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 505, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 113, in run run_eval(args) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 285, in run_eval is_training=True) File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 210, in load_batch capacity=3 * batch_size) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 872, in batch name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 665, in _batch dequeued = queue.dequeue_up_to(batch_size, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 499, in dequeue_up_to self._queue_ref, n=n, component_types=self._dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1402, in _queue_dequeue_up_to_v2 timeout_ms=timeout_ms, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in init self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue '_6_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0) [[Node: batch = QueueDequeueUpToV2[component_types=[DT_UINT8, DT_INT64, DT_STRING, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]] To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?...
Upvotes: 0
Views: 1583
Reputation: 107
After extensive study and reading on Github, many reported that eliminating this issue was a matter of making sure that the initializer for local and global variables is run at the top of the session. Like this one Using the following:
tf.group(tf.local_variables_initializer(), tf.global_variables_initializer{}
However, that did not fix the issue for many (including me), and I suspect for those that it did work, there were other problems leading to an empty FIFO queue.
After much reading, it appears that this is a defect for which there is not an obvious fix. Several work arounds are proposed. I was running a full cycle of train, eval, and predict. Here is the approach which worked for me:
1) On training, I set num_epochs=None. This cycles through the data an infinite number of times and if the documentation is correct, each example is presented only once per epoch. I did spot checking to confirm this, but my dataset was too large to guarantee the docs are correct. That said, my model did not overfit. Train, test, and validation were all reasonably close in terms of accuracy.
2) On eval, I was building a 15 model ensemble and I wanted to compare the proposal selection to ground truth before submitting unlabeled data for validation. I kept an extra hold out set from a k-fold cross validation run and needed to be sure that the each example in the hold out set was predicted once and only once. So to make that work, I: a)set num_epochs=1, b) eliminated all calculations from the eval graph except the prediction, c) reduced the size of the eval set to ~3000 examples, d) set shuffle_batch=False, e) set the batch size so that the queue would have a few extra examples
With these conditions, the queue runners did not run out of examples before my graph completed and I got my test set
3) On predict, I used the same technique again as for eval except that I chose a batch size and number of train steps that was exactly equal to the number of predict records. Since there was no gradient back prop, the predicts were fast enough to finish before the queue runner could kill my job.
Problem solved. Jury rigged. But, it worked. Desperation is the mother of ingenuity or something like that!
Upvotes: 0