Jacob Holloway
Jacob Holloway

Reputation: 887

Distributed Tensorflow Errors/

When running a distributed tensorflow (TF v0.9.0rc0) set up, I start up 3 parameter servers and then 6 workers. The parameter servers seem to be fine, giving the message Started server with target: grpc://localhost:2222. But the workers give other errors (below) that I have questions about.

It seems to me that sometimes the computers aren't able to communicate with each other, thereby giving the socket error, connection refused errors. It also seems that the workers aren't able to find the parameter servers when initializing their variables and give the Cannot assign a device error.

Can anyone help me out in understanding what theses errors individually mean, how big of a deal each one is, and perhaps give me pointers in how to fix them if needed?

Specifically:

  1. Why am I getting socket errors?
  2. Why are there Master init: Unavailable issues / what do they mean?
  3. How can I ensure that the devices requested are available?
  4. Does this look like something I should post to the issues page of tensorflow's github account?

Notes on setup:


All of them give this error(ip addreses change):

E0719 12:06:17.711635677    2543 tcp_client_posix.c:173]  
 failed to connect to 'ipv4:192.168.xx.xx:2222': socket error: connection refused

But all of the non-chief workers also give:

E tensorflow/core/distributed_runtime/master.cc:202] Master init: Unavailable: 

Additionally, some of the non-chief workers crash, giving this error:

Traceback (most recent call last):  
    File "main.py", line 219, in <module>  
        r.main()  
    File "main.py", line 119, in main  
        with sv.prepare_or_wait_for_session(server.target, config=tf.ConfigProto(gpu_options=gpu_options)) as sess:  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 691, in prepare_or_wait_for_sessionn max_wait_secs=max_wait_secs)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 282, in wait_for_session  
        sess.run([self._local_init_op])  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run
        run_metadata_ptr)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _run  
        feed_dict_string, options, run_metadata)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run  
        target_list, options, run_metadata)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call  
        raise type(e)(node_def, op, message)  
    tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_23':
        Could not satisfy explicit device specification '/job:ps/task:3/device:CPU:0'
        because no devices matching that specification are registered in this process; available devices: 
            /job:ps/replica:0/task:0/cpu:0,
            /job:ps/replica:0/task:1/cpu:0,
            /job:ps/replica:0/task:2/cpu:0,
            /job:ps/replica:0/task:4/cpu:0,
            /job:worker/replica:0/task:0/cpu:0,
            /job:worker/replica:0/task:0/gpu:0,
            /job:worker/replica:0/task:1/cpu:0,
            /job:worker/replica:0/task:1/gpu:0,
            /job:worker/replica:0/task:2/cpu:0,
            /job:worker/replica:0/task:2/gpu:0 
[[Node: save/restore_slice_23 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/task:3/device:CPU:0"](save/Const, save/restore_slice_23/tensor_name, save/restore_slice_23/shape_and_slice)]]
Caused by op u'save/restore_slice_23', defined at:  
    File "main.py", line 219, in <module>  
        r.main()  
    File "main.py", line 101, in main  
        saver = tf.train.Saver()  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__  
        restore_sequentially=restore_sequentially)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build  
        filename_tensor, vars_to_save, restore_sequentially, reshape)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps  
        values = self.restore_op(filename_tensor, vs, preferred_shard)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
        preferred_shard=preferred_shard)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 202, in _restore_slice  
        preferred_shard, name=name)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 358, in _restore_slice  
        preferred_shard=preferred_shard, name=name)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op  
        op_def=op_def)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2260, in create_op  
        original_op=self._default_original_op, op_def=op_def)  
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1230, in __init__  
        self._traceback = _extract_stack()

Upvotes: 1

Views: 1075

Answers (1)

Jacob Holloway
Jacob Holloway

Reputation: 887

I figured out what my problem was.

TL;DR: The chief needs to know about all the variables in order to initialize them all. Non-chief workers can't create their own variables.

I was converting an old program where all workers had a few independent variables, but needed to share some variables (I was using ZMQ to pass these) to a distributed TensorFlow setup, and forgot to initialize all of the variables on all of the workers. I had something like

# Create worker specific variable
with tf.variable_scope("world_{}".format(**worker_id**)):
    w1 = tf.get_variable("weight", shape=(input_dim, hidden_dim), dtype=tf.float32, initializer=tf.truncated_normal_initializer())

instead of doing something like this:

# Create all worker specific variables
all_w1 = {}
for worker in worker_cnt:
    with tf.variable_scope("world_{}".format(**worker_id**)):  
        all_w1[worker] = tf.get_variable("weight", shape=(input_dim, hidden_dim), dtype=tf.float32, initializer=tf.truncated_normal_initializer())

# grab worker specific variable
w1 = all_w1[**worker_id**] 

As for the errors...

I suspect that this caused some workers to die with the Master init: Unavailable: error message above because the chief never knew about the variables the workers wanted to create.

I don't have a solid explanation for why the devices unavailable (3rd) error didn't find that device, but I think it's again, because only the master could create that, and he didn't know about the new variables.

The 1st error seems to be because the computers weren't ready to talk after their failures, as I haven't seen that error after making the fixes. I still see it if I kill a worker and start him up again, but it doesn't seem to be an issue if they all start up together.


Anyway, I hope that's helpful if anyone ever has the same error later on.

Upvotes: 1

Related Questions