Code order influences the final result

Question

I encountered a problem that the code order influences the final result. At first, the code works. After I move one line, tensorflow generates an error.

For example,

working version:

probs = net.get_output()
label_node = tf.placeholder(tf.int32, name='label_node')
top_1_op = tf.nn.in_top_k(probs, label_node, 1)
top_5_op = tf.nn.in_top_k(probs, label_node, 5)
threads = image_producer.start(session=sess, coordinator=coordinator)
for (labels, images) in image_producer.batches(sess):
    top_1_result, top_5_result = sess.run([top_1_op, top_5_op],
                                    feed_dict={input_node: images, label_node: labels})

Non-working version:

threads = image_producer.start(session=sess, coordinator=coordinator)   # move here
probs = net.get_output()
label_node = tf.placeholder(tf.int32, name='label_node')
top_1_op = tf.nn.in_top_k(probs, label_node, 1)
top_5_op = tf.nn.in_top_k(probs, label_node, 5)
for (labels, images) in image_producer.batches(sess):
    top_1_result, top_5_result = sess.run([top_1_op, top_5_op],
                                    feed_dict={input_node: images, label_node: labels})

Tensorflow generates an error

"tensorflow.python.framework.errors.NotFoundError: FeedInputs: unable to find feed output label_node:0".

As you can see, tensorflow should be able to find "label_node:0". Actually, tensorflow cannot find top_1_op and top_5_op either.

The content of image_producer.start is something similar to:

op_A = ...
queue_runner = tf.train.QueueRunner(queue_B, [op_B] * num_concurrent)
session.run(op_A)
t = queue_runner.create_threads(session, coord=coordinator, start=True)

A more strange thing is that in the non-workable version, after I add two lines in image_producer.start, the code works again. For example, image_producer.start becomes

op_C = ...  # new
session.run(op_C)  # new
op_A = ...
queue_runner = tf.train.QueueRunner(queue_B, [op_B] * num_concurrent)
session.run(op_A)
t = queue_runner.create_threads(session, coord=coordinator, start=True)

Does anyone have an idea about possible causes of this problem? Or any idea about how to debug this?

mrry · Accepted Answer

It sounds like you are suffering from a bug that was fixed after TensorFlow 0.9.0 was released. In that version (and earlier) TensorFlow suffered from a race condition that could lead to unrecoverable errors if you modified the graph after queue runners (or other threads calling sess.run()) had started. The only workaround in version 0.9.0 is to start the queue runners (i.e. the image_producer in your code) after the graph has been completely constructed.

Code order influences the final result

Answers (1)

Related Questions