Angus393
Angus393

Reputation: 51

breaking change to distributed training moving from TF v1.3 to v1.4: "UnavailableError: Trying to connect an http1.x server"

When creating a managed session to use for distributed training with this line:

with sv.managed_session(server.target, config=config) as sess, sess.as_default():

I get this error (full stack trace at bottom) on the chief worker:

tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

Everything still seems to be fine on the parameter server which reports:

E1106 11:26:32.844686639    5543 ev_epoll1_linux.c:1051]     grpc epoll fd: 8
2017-11-06 11:26:32.851773: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12222}   
2017-11-06 11:26:32.851863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223}
2017-11-06 11:26:32.856802: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12222

I only receive this error when using the new v1.4 of tensorflow built from source (found same problem when installing from pip). Everything works fine in v1.3 . Does anyone know if there's been a breaking change made, I'm assuming with respect to how tensorflow works with grpc?

I'm wondering if this has something to do with http2 vs http1? I see GRPC seems to work with protobuf across http2, and this seems to be indicating its trying to connect with http1, but still doesn't explain why this breaks just when upgrading v1.3 to v1.4

Does anyone know any more around what that error

UnavailableError: Trying to connect an http1.x server

is referring to or what might be a fix here?

I am working on RedHat Linux and trying to do distributed training across processes on the same localhost...not even trying to go over the network. I'd appreciate any thoughts, and hope this can help others with the same problem as well.


Full stacktrace:

E1106 11:28:24.383745692    5787 ev_epoll1_linux.c:1051]     grpc epoll fd: 8
2017-11-06 11:28:24.391084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize 

GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-06 11:28:24.391185: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize 

GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-06 11:28:24.392285: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server 

with target: grpc://localhost:12223
2017-11-06 11:28:37.875632: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: 

Trying to connect an http1.x server
Traceback (most recent call last):
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in 

_do_call
    return fn(*args)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in 

_run_fn
    self._extend_graph()
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in 

_extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, 

in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1599, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1026, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "worker.py", line 426, in <module>
    main()
  File "worker.py", line 418, in main
    run(args, server)
  File "worker.py", line 174, in run
    sess.run(trainer.sync)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in 

_run
    feed_dict_tensor, options, run_metadata)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in 

_do_run
    options, run_metadata)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in 

_do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

Upvotes: 4

Views: 1559

Answers (1)

Angus393
Angus393

Reputation: 51

if you follow @NoahEisen suggestion and

export GRPC_VERBOSITY="DEBUG"

you'll see something more informative like this:

E1108 17:37:57.085195825   17711 ev_epoll1_linux.c:1051]     grpc epoll fd: 5
D1108 17:37:57.085309439   17711 ev_posix.c:111]             Using polling engine: epoll1
D1108 17:37:57.085380147   17711 dns_resolver.c:301]         Using native dns resolver
I1108 17:37:57.085819333   17711 socket_utils_common_posix.c:223] Disabling AF_INET6 sockets because ::1 is not available.
I1108 17:37:57.086001584   17711 tcp_server_posix.c:322]     Failed to add :: listener, the environment may not support IPv6: {"created":"@1510180677.085876868","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":256,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12223"}
2017-11-08 17:37:57.092525: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-08 17:37:57.092648: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-08 17:37:57.093435: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12223
D1108 17:38:02.607109518   17830 http_proxy.c:70]            userinfo found in proxy URI
I1108 17:38:02.611335569   17807 http_connect_handshaker.c:304] Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
2017-11-08 17:38:02.617814: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: Trying to connect an http1.x server

I am behind a proxy, but i am only trying to do distributed training on the localhost. For some reason it tries to connect via the proxy even tho the IP 127.0.0.1 should be equivalent to localhost right? IE note this part in particular:

Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx

I guess this was lazy in my python code. If I change the ps to "localhost" explicitly in the cluster spec instead of the IP 127.0.0.1 everything seems to work again in TF1.4 because its not trying to connect to the localhost via my proxy server (which indeed, was HTTP1.x only i think).

@PeteWaren - does this constitute an actual bug in tensorflow or grpc? Should these note be equivalent localhost=127.0.0.1? Either way, the way its handled has changed from TF1.3 to TF1.4

Thanks for everyones help

Upvotes: 1

Related Questions