Reputation: 115
I'm trying to connect to remote ray.io cluster head node using ray.init(address="{node_external_ip}:6379") for further remote procedure call load testing.
I launch the head node with the following command:
ray start --head --node-ip-address <node-external-IP>
(Note: I specify the head node's external IP, since otherwise, according to results of my previous attempts, the client would fail to establish connection with remote cluster at all. TCP port used is default 6379, I double checked that it's open and accessible).
After that, although the client succeeds to establish connection with remote cluster:
Connecting to existing Ray cluster at address: <node-external-IP>:6379...
global_state_accessor.cc:357: This node has an IP address of <client-internal-IP>, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
Connected to Ray cluster.
...it consequently fails with the following message:
Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: .Please see `raylet.out` for more details.
In its turn, raylet.out
at the remote cluster side contains the following log record:
The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
...while the dashboard_agent.log
shows:
ERROR agent.py:473 -- Agent is working abnormally. It will exit immediately.
(...)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1661955376.270755430","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1661955376.270754305","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
TCP port for the dashboard is also open and accessible.
When launched with --include-dashboard false
CLI option, results are the same, and log records are identical even for the dashboard_agent.log
Also, several seconds after being lauched with --block
option, the head node dies with the following message:
Some Ray subprocesses exited unexpectedly:
raylet [exit code=1]
Remaining processes will be killed.
Log records are absolutely the same.
I made sure that both client and remote cluster head node use the same versions of Python and ray (tested with ray 1.12.0, 1.13.0, 2.0.0; Python 3.9.13, 3.10.5).
I've also tried to specify _node_ip_address and adding "ray://" when calling ray.init(), and it still fails.
Client-side OS: Manjaro Linux x86_64, kernel 5.10.136-1-MANJARO.
Remote cluster-side OS: Ubuntu 20.04 x86_64, kernel 5.13.0-1031-aws (it's an AWS EC2 instance). I also tried to deploy remote cluster on a physical machine with above mentioned Manjaro Linux setup and got the same result.
Docker is not being used.
What could be workarounds for this issue?
Upvotes: 1
Views: 2935
Reputation: 11
It looks like you're using the GCS server port (6379
), but what you're probably looking for here is the Ray Client port 10001
. Can you try connecting with ray.init("ray://<address>:10001")
Ray Client documentation for more details: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
Upvotes: 1