Distributed TensorFlow: about the using of tf.train.Supervisor.start_queue_runners

Question

I'm looking into the code for distributed inception model in TF, in which I have below questions about the use of tf.train.Supervisor.start_queue_runners in inception_distributed_train.py:

Why do we need to explicitly call sv.start_queue_runners() in line 264 and line 269 in inception_distributed_train.py? In API doc. of start_queue_runners, I see there is no need for such calls due to:

Note that the queue runners collected in the graph key QUEUE_RUNNERS are already started automatically when you create a session with the supervisor, so unless you have non-collected queue runners to start you do not need to call this explicitly.
I noticed the values of queue_runners in calling sv.start_queue_runners are different in line 264 and line 269 in inception_distributed_train.py. But aren't chief_queue_runners also in the collection of tf.GraphKeys.QUEUE_RUNNERS (all QUEUE_RUNNERS are obtained in line 263)? If so, then there is no need for line 269 since the chief_queue_runners has already been started in line 264.
Besides, could you please explain to me or show me some references about what queues are created in tf.train.Supervisor?

Thanks for your time!

Yaroslav Bulatov · Accepted Answer

Not an answer, but some general notes how to find an answer :)

First of all, using github's blame, inception_distributed was checked in on April 13, while that comment in start_queue_runners was added on Apr 15th, so it's possible that functionality was changed but didn't get updated in all the places that use it.

You could comment-out that line and see if things still work. And if not, you could add import pdb; pdb.set_trace() in the place where queue runner gets created (ie here) and see who is creating those extra unattended queue runners.

Also, Supervisor development seems to have slowed down and things are getting moved over to FooSession (from comment here). Those provide a more robust training architecture (your workers won't crash because of temporary network error), but there are not many examples on how to use them yet.

Distributed TensorFlow: about the using of tf.train.Supervisor.start_queue_runners

Answers (1)

Related Questions