awesomeyou
awesomeyou

Reputation: 1

Distributed TensorFlow hangs during CreateSession

I am new to distributed TensorFlow. Right now I am just trying to get some existing examples to work so I can learn how to do it right.

I am following the instruction here to train the inception network on one Linux machine with one worker and one PS. https://github.com/tensorflow/models/tree/master/research/inception#how-to-train-from-scratch-in-a-distributed-setting

The program hangs during CreateSession with the message: CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

This my command to start a worker:

./bazel-bin/inception/imagenet_distributed_train \
    --batch_size=32 \
    --data_dir=/datasets/BigLearning/jinlianw/imagenet_tfrecords/ \
    --job_name='worker' \
    --task_id=0 \
    --ps_hosts='localhost:2222' \
    --worker_hosts='localhost:2223'

This is my command to start a PS:

./bazel-bin/inception/imagenet_distributed_train \
    --job_name='ps' \
    --task_id=0 \
    --ps_hosts='localhost:2222' \
    --worker_hosts='localhost:2223'

And the PS process hangs after printing:

2018-06-29 21:40:43.097361: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:2222

Is the inception model still a valid example for distributed TensorFlow or did I do something wrong?

Thanks!

Upvotes: 0

Views: 323

Answers (1)

awesomeyou
awesomeyou

Reputation: 1

Problem resolved. Turns out it's due to GRPC. My cluster machines have an environment variable http_proxy set. Unset this variable solves the problem.

Upvotes: 0

Related Questions