Tensorflow freezes when not using a supervisor

Question

No GPU, no queues, Tensorflow 1.1.0

There's this sample LSTM code:

https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py

This code works. It prints training process info, it's cool. Now, I tried to write a trained model graph to disk using freeze_graph(), and eventually I found out that this LSTM tutorial uses a Supervisor to train the model, and Supervisor freezes the graph, and a frozen graph can not be used in the freeze_graph() procedure.

I tried to switch from a Supervisor to using an ordinary session. The only changes made were in the main() procedure (apart from importing some stuff). It now looks like this (changed parts are highlighted, and I removed all graph-saving related stuff, it's not the matter here):

with tf.Graph().as_default():
        initializer = tf.random_uniform_initializer(
            -config.init_scale, config.init_scale)
        with tf.name_scope("Train"):
            train_input = PTBInput(
                config=config, data=train_data, name="TrainInput")
            with tf.variable_scope("Model", reuse=None, initializer=initializer):
                m = PTBModel(
                    is_training=True, config=config, input_=train_input)
            tf.summary.scalar("Training Loss", m.cost)
            tf.summary.scalar("Learning Rate", m.lr)
        with session.Session() as sess:  # CHANGED
            sess.run(variables.global_variables_initializer())  # CHANGED
            for i in range(config.max_max_epoch):
                lr_decay = config.lr_decay ** max(i +
                                                  1 - config.max_epoch, 0.0)
                m.assign_lr(sess, config.learning_rate * lr_decay)
                print("Epoch: %d Learning rate: %.3f" %
                      (i + 1, sess.run(m.lr)))
                train_perplexity = run_epoch(sess, m, eval_op=m.train_op,
                                             verbose=True)
                print("Epoch: %d Train Perplexity: %.3f" %
                      (i + 1, train_perplexity))

After these changes the whole thing started to freeze at this very line:

https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py#L300

It is a session.run() call in model internals (doesn't react to Ctrl+C, killable with kill -9):

vals = session.run(fetches, feed_dict)

Previous session.run() calls (there's some) worked just fine.

What did I do wrong? It seems like all variables are initialized just fine (which was done by Supervisor in the original code). Any ideas?

mrry · Accepted Answer

When you use tf.train.Supervisor, the framework code automatically calls tf.train.start_queue_runners(sess) (along with initializing variables) at the beginning of the session. If you switch back to using a raw tf.Session, you must call this manually to start the input pipeline. A change like the following should work:

# ...
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  tf.train.start_queue_runners(sess)
  # ...

Tensorflow freezes when not using a supervisor

Answers (1)

Related Questions