Tensorflow distributed passing devices

Question

I recently installed the version of tensorflow for distributed processing. From the trend, I tried to implement with multiple gpus on multiple computers, and also found a white paper for some additional specifications. I can run the server and a worker on 2 different computers with 2 and 1 gpus respectively, and using the session grpc, allocate and run the program on remote or local mode.

I ran locally tensorflow in the remote computer with:

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='local|localhost:2500' --job_name=local --task_id=0 &

and for using on the server

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='worker|192.168.170.193:2500,prs|192.168.170.226:2500' --job_name=worker --task_id=0 \
--job_name=prs --task_id=0 &

However, when i try to specify the device for running on 2 computers at the same time the python show me the error:

 Could not satisfy explicit device specification '/job:worker/task:0'

when i use

with tf.device("/job:prs/task:0/device:gpu:0"):
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
with tf.device("/job:prs/task:0/device:gpu:1"):
  b = tf.Variable(tf.zeros([10], name='bias'))
# Use a name scope to organize nodes in the graph visualizer
with tf.device("/job:worker/task:0/device:gpu:0"):
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

or even changin the name of job. So I am wondering if it is required to Add a New Device or probably I am doing something wrong with the initialization of the cluster.

LKT · Accepted Answer

The worker is really the name of the cluster.

Your first bazel call should be like this:

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \ --cluster_spec='worker|192.168.170.193:2500;192.168.170.226:2501' --job_name=worker --task_id=0 &

Which you run on the first node, 192.168.170.193

Your cluster name is worker, which includes the IP addresses of the two nodes. The task then refers to the the two running nodes. You must start the protocol on both nodes, specifying the different task ID for each node, ie. then run:

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='worker|192.168.170.193:2500;192.168.170.226:2501' --job_name=worker --task_id=1 &`

on your second node, 192.168.170.226

then run:

with tf.device("/job:worker/task:0/device:gpu:0"):
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
with tf.device("/job:worker/task:0/device:gpu:1"):
  b = tf.Variable(tf.zeros([10], name='bias'))
# Use a name scope to organize nodes in the graph visualizer
with tf.device("/job:worker/task:1/device:gpu:0"):
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

Tensorflow distributed passing devices

Answers (1)

Related Questions