Reputation: 4170
I am trying to make a distributed tensorflow
implementation by following the instructions in this blog: Distributed TensorFlow by Leo K. Tam. My aim is to perform replicated training
as mentioned in this post
I have completed the steps till installing tensorflow
and successfully running the following command and getting results:
sudo bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
Now the next thing, which I want to implement is to launch the gRPC server
on one of the nodes by the following command :
bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='worker|192.168.555.254:2500;192.168.555.255:2501' --job_name=worker --task_id=0 &
Though, when I run it, I get the following error: rpc/grpc_tensorflow_server:No such file directory
-bash: bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server: No such file or directory
The contents of my rpc
folder are:
libgrpc_channel.pic.a libgrpc_remote_master.pic.lo libgrpc_session.pic.lo libgrpc_worker_service_impl.pic.a _objs/
libgrpc_master_service_impl.pic.a libgrpc_remote_worker.pic.a libgrpc_tensor_coding.pic.a libgrpc_worker_service.pic.a
libgrpc_master_service.pic.lo libgrpc_server_lib.pic.lo libgrpc_worker_cache.pic.a librpc_rendezvous_mgr.pic.a
I am clearly missing out on a step in between, which is not mentioned in the blog. My objective is to be able to run the command mentioned above (to launch the gRPC server
) so that I can start a worker process on one of the nodes.
Upvotes: 2
Views: 1161
Reputation: 126174
The grpc_tensorflow_server
binary was a temporary measure used in the pre-released version of Distributed TensorFlow, and it is no longer built by default or included in the binary distributions. Its replacement is the tf.train.Server
Python class, which is more programmable and easier to use.
You can write simple Python scripts using tf.train.Server
to reproduce the behavior of grpc_tensorflow_server
:
# ps.py. Run this on 192.168.0.1. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
{"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
job_name="ps", task_index=0)
server.join()
# worker_0.py. Run this on 192.168.0.2.
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
{"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
job_name="worker", task_index=0)
server.join()
# worker_1.py. Run this on 192.168.0.3. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
{"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
job_name="worker", task_index=1)
server.join()
Clearly this example could be cleaned up and made reusable with command-line flags etc., but TensorFlow doesn't prescribe a particular form for these. The main things to note is that (i) there is one tf.train.Server
instance per TensorFlow task, (ii) all Server
instances must be constructed with the same "cluster definition" (the dictionary mapping job names to lists of addressess), and (iii) each task is identified by a unique pair of job_name
and task_index
.
Once you run the three scripts on the respective machines,, you can create another script to connect to them:
import tensorflow as tf
sess = tf.Session("grpc://192.168.0.2:2222")
# ...
Upvotes: 2