Distributed tensorflow replicated training example: grpc_tensorflow_server - No such file or directory

Question

I am trying to make a distributed tensorflow implementation by following the instructions in this blog: Distributed TensorFlow by Leo K. Tam. My aim is to perform replicated training as mentioned in this post

I have completed the steps till installing tensorflow and successfully running the following command and getting results:

sudo bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

Now the next thing, which I want to implement is to launch the gRPC server on one of the nodes by the following command :

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='worker|192.168.555.254:2500;192.168.555.255:2501' --job_name=worker --task_id=0 &

Though, when I run it, I get the following error: rpc/grpc_tensorflow_server:No such file directory

-bash: bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server: No such file or directory

The contents of my rpc folder are:

 libgrpc_channel.pic.a              libgrpc_remote_master.pic.lo       libgrpc_session.pic.lo             libgrpc_worker_service_impl.pic.a  _objs/                             
 libgrpc_master_service_impl.pic.a  libgrpc_remote_worker.pic.a        libgrpc_tensor_coding.pic.a        libgrpc_worker_service.pic.a       
 libgrpc_master_service.pic.lo      libgrpc_server_lib.pic.lo          libgrpc_worker_cache.pic.a         librpc_rendezvous_mgr.pic.a

I am clearly missing out on a step in between, which is not mentioned in the blog. My objective is to be able to run the command mentioned above (to launch the gRPC server) so that I can start a worker process on one of the nodes.

mrry · Accepted Answer

The grpc_tensorflow_server binary was a temporary measure used in the pre-released version of Distributed TensorFlow, and it is no longer built by default or included in the binary distributions. Its replacement is the tf.train.Server Python class, which is more programmable and easier to use.

You can write simple Python scripts using tf.train.Server to reproduce the behavior of grpc_tensorflow_server:

# ps.py. Run this on 192.168.0.1. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="ps", task_index=0)
server.join()

# worker_0.py. Run this on 192.168.0.2.
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="worker", task_index=0)
server.join()

# worker_1.py. Run this on 192.168.0.3. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="worker", task_index=1)
server.join()

Clearly this example could be cleaned up and made reusable with command-line flags etc., but TensorFlow doesn't prescribe a particular form for these. The main things to note is that (i) there is one tf.train.Server instance per TensorFlow task, (ii) all Server instances must be constructed with the same "cluster definition" (the dictionary mapping job names to lists of addressess), and (iii) each task is identified by a unique pair of job_name and task_index.

Once you run the three scripts on the respective machines,, you can create another script to connect to them:

import tensorflow as tf

sess = tf.Session("grpc://192.168.0.2:2222")
# ...

Distributed tensorflow replicated training example: grpc_tensorflow_server - No such file or directory

Answers (1)

Related Questions