Reputation: 827
I want to run tensorflow on multiple machines, multiple GPUs. As an initial step, trying out distributed tensorflow on single machine (following tensorflow tutorial https://www.tensorflow.org/how_tos/distributed/)
Bellow are the lines after which sess.run() stucks
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
a = tf.constant(8)
b = tf.constant(9)
sess = tf.Session('grpc://localhost:2222')
Everything is working fine till here, but when I am running sess.run(), it stucks.
sess.run(tf.mul(a,b))
If anybody has already worked on distributed tensorflow, Please let me know the solution or other tutorial which works fine.
Upvotes: 0
Views: 713
Reputation: 126154
By default, Distributed TensorFlow will block until all servers named in the tf.train.ClusterSpec
have started. This happens during the first interaction with the server, which will typically be the first sess.run()
call. Therefore, if you haven't also started a server listening on localhost:2223
, then TensorFlow will block until you do.
There are a few solutions to this problem, depending on your later goals:
Start a server on localhost:2223
. In another process, run the following script:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=1)
server.join() # Wait forever for incoming connections.
Remove task 1 from the original tf.train.ClusterSpec
:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
# ...
Specify a "device filter" when you create the tf.Session
so that the session only uses task 0.
# ...
sess = tf.Session("grpc://localhost:2222",
config=tf.ConfigProto(device_filters=["/job:local/task:0"]))
Upvotes: 2