Pramod Patil
Pramod Patil

Reputation: 827

Distributed Tensorflow is getting stuck at sess.run()

I want to run tensorflow on multiple machines, multiple GPUs. As an initial step, trying out distributed tensorflow on single machine (following tensorflow tutorial https://www.tensorflow.org/how_tos/distributed/)

Bellow are the lines after which sess.run() stucks

import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
a = tf.constant(8)
b = tf.constant(9)
sess = tf.Session('grpc://localhost:2222')

Everything is working fine till here, but when I am running sess.run(), it stucks.

    sess.run(tf.mul(a,b))

If anybody has already worked on distributed tensorflow, Please let me know the solution or other tutorial which works fine.

Upvotes: 0

Views: 713

Answers (1)

mrry
mrry

Reputation: 126154

By default, Distributed TensorFlow will block until all servers named in the tf.train.ClusterSpec have started. This happens during the first interaction with the server, which will typically be the first sess.run() call. Therefore, if you haven't also started a server listening on localhost:2223, then TensorFlow will block until you do.

There are a few solutions to this problem, depending on your later goals:

  1. Start a server on localhost:2223. In another process, run the following script:

     import tensorflow as tf
     cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
     server = tf.train.Server(cluster, job_name="local", task_index=1)
     server.join()  # Wait forever for incoming connections.
    
  2. Remove task 1 from the original tf.train.ClusterSpec:

     import tensorflow as tf
     cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
     server = tf.train.Server(cluster, job_name="local", task_index=0)
     # ...
    
  3. Specify a "device filter" when you create the tf.Session so that the session only uses task 0.

     # ...
     sess = tf.Session("grpc://localhost:2222",
                       config=tf.ConfigProto(device_filters=["/job:local/task:0"]))
    

Upvotes: 2

Related Questions