SirKnightRyder
SirKnightRyder

Reputation: 1

Distributed Tensorflow : CreateSession still waiting only for different nodes

I am trying to get mnist_replica.py example work. As per suggestion on this question, I am specifying device filter.

My code works when ps and worker tasks are on the same node. When I try to put ps task on node1 and worker task on node2, I get "CreateSession still waiting".

For example:

Pseudo Distributed Version (works!)

Terminal Dump of Node1 (instance 1)

node1 $ python mnist_replica.py --worker_hosts=node1:2223 --job_name=ps --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = ps
task index = 0
2017-10-10 11:09:16.637006: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-10-10 11:09:16.637075: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> node1:2223}
2017-10-10 11:09:16.640114: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
...

Terminal Dump of Node1 (instance 2)

node1 $ python mnist_replica.py --worker_hosts=node1:2223 --job_name=worker --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = worker
task index = 0
2017-10-10 11:11:12.784982: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-10-10 11:11:12.785046: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223}
2017-10-10 11:11:12.787685: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2223
Worker 0: Initializing session...
2017-10-10 11:11:12.991784: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session 418af3aa5ce103a3 with config: device_filters: "/job:ps" device_filters: "/job:worker/task:0" allow_soft_placement: true
Worker 0: Session initialization complete.
Training begins @ 1507648273.272837
1507648273.443305: Worker 0: training step 1 done (global step: 0)
1507648273.454537: Worker 0: training step 2 done (global step: 1)
...

2 Nodes Distributed ( don't work)

Terminal Dump of Node1

node1 $ python mnist_replica.py --worker_hosts=node2:2222 --job_name=ps --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = ps
task index = 0
2017-10-10 10:54:27.419949: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-10-10 10:54:27.420064: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> node2:2222}
2017-10-10 10:54:27.426168: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
...

Terminal Dump of Node2

node2 $ python mnist_replica.py --ps_hosts=node1:2222 --worker_hosts=node2:2222 --job_name=worker --task_index=0
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
job name = worker
task index = 0
2017-10-10 10:51:13.303021: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> node1:2222}
2017-10-10 10:51:13.303081: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
2017-10-10 10:51:13.308288: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
Worker 0: Initializing session...
2017-10-10 10:51:23.508040: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2017-10-10 10:51:33.508247: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
...

Both nodes run CentOS7, Tensorflow R1.3, Python 2.7. Nodes can talk to each other via ssh, hostnames are correct, firewall is disabled. Anything missing?

Are there any additional steps I need to take to make sure nodes can talk to each other using GRPC? Thanks.

Upvotes: 0

Views: 629

Answers (2)

SirKnightRyder
SirKnightRyder

Reputation: 1

The issue was firewall was blocking ports. I disabled firewall on all nodes in questions and issue resolved itself!

Upvotes: 0

jwl1993
jwl1993

Reputation: 324

I think you would better check ClusterSpec and server part. For example, you should check the ip address for node1 and node2, check the port and task index and etc. I want to give specific suggestion but it is hard to give you suggestions without the code. Thanks.

Upvotes: 0

Related Questions