Reputation: 51
When creating a managed session to use for distributed training with this line:
with sv.managed_session(server.target, config=config) as sess, sess.as_default():
I get this error (full stack trace at bottom) on the chief worker:
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
Everything still seems to be fine on the parameter server which reports:
E1106 11:26:32.844686639 5543 ev_epoll1_linux.c:1051] grpc epoll fd: 8
2017-11-06 11:26:32.851773: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12222}
2017-11-06 11:26:32.851863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223}
2017-11-06 11:26:32.856802: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12222
I only receive this error when using the new v1.4 of tensorflow built from source (found same problem when installing from pip). Everything works fine in v1.3 . Does anyone know if there's been a breaking change made, I'm assuming with respect to how tensorflow works with grpc?
I'm wondering if this has something to do with http2 vs http1? I see GRPC seems to work with protobuf across http2, and this seems to be indicating its trying to connect with http1, but still doesn't explain why this breaks just when upgrading v1.3 to v1.4
Does anyone know any more around what that error
UnavailableError: Trying to connect an http1.x server
is referring to or what might be a fix here?
I am working on RedHat Linux and trying to do distributed training across processes on the same localhost...not even trying to go over the network. I'd appreciate any thoughts, and hope this can help others with the same problem as well.
Full stacktrace:
E1106 11:28:24.383745692 5787 ev_epoll1_linux.c:1051] grpc epoll fd: 8
2017-11-06 11:28:24.391084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-06 11:28:24.391185: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-06 11:28:24.392285: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server
with target: grpc://localhost:12223
2017-11-06 11:28:37.875632: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable:
Trying to connect an http1.x server
Traceback (most recent call last):
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in
_do_call
return fn(*args)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in
_run_fn
self._extend_graph()
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in
_extend_graph
self._session, graph_def.SerializeToString(), status)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473,
in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1599, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1026, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "worker.py", line 426, in <module>
main()
File "worker.py", line 418, in main
run(args, server)
File "worker.py", line 174, in run
sess.run(trainer.sync)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in
_run
feed_dict_tensor, options, run_metadata)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in
_do_run
options, run_metadata)
File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in
_do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server
Upvotes: 4
Views: 1559
Reputation: 51
if you follow @NoahEisen suggestion and
export GRPC_VERBOSITY="DEBUG"
you'll see something more informative like this:
E1108 17:37:57.085195825 17711 ev_epoll1_linux.c:1051] grpc epoll fd: 5
D1108 17:37:57.085309439 17711 ev_posix.c:111] Using polling engine: epoll1
D1108 17:37:57.085380147 17711 dns_resolver.c:301] Using native dns resolver
I1108 17:37:57.085819333 17711 socket_utils_common_posix.c:223] Disabling AF_INET6 sockets because ::1 is not available.
I1108 17:37:57.086001584 17711 tcp_server_posix.c:322] Failed to add :: listener, the environment may not support IPv6: {"created":"@1510180677.085876868","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":256,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12223"}
2017-11-08 17:37:57.092525: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-08 17:37:57.092648: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-08 17:37:57.093435: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12223
D1108 17:38:02.607109518 17830 http_proxy.c:70] userinfo found in proxy URI
I1108 17:38:02.611335569 17807 http_connect_handshaker.c:304] Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
2017-11-08 17:38:02.617814: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: Trying to connect an http1.x server
I am behind a proxy, but i am only trying to do distributed training on the localhost. For some reason it tries to connect via the proxy even tho the IP 127.0.0.1 should be equivalent to localhost right? IE note this part in particular:
Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
I guess this was lazy in my python code. If I change the ps to "localhost" explicitly in the cluster spec instead of the IP 127.0.0.1 everything seems to work again in TF1.4 because its not trying to connect to the localhost via my proxy server (which indeed, was HTTP1.x only i think).
@PeteWaren - does this constitute an actual bug in tensorflow or grpc? Should these note be equivalent localhost=127.0.0.1? Either way, the way its handled has changed from TF1.3 to TF1.4
Thanks for everyones help
Upvotes: 1