Reputation: 31
I'm trying to set up a cluster with one ray-head and two ray-workers with docker swarm. I have three machines for this, one running the ray-had and two each one ray-worker. The cluster comes up ok, but whenever i exec into a container and run:
import ray
ray.init(redis-address='ray-head:6379')
i get
WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
The logs of the containers looḱ ok.
I have also with IPs as well, both the machine and the ip of the ray-head container.
ray.init(redis-address='192.168.30.193:6379')
When running:
telnet 192.168.30.193 6379
there is an answer.
Dockerfile for the containers:
FROM python:2.7-slim
RUN apt-get -y update
RUN apt-get install -y --fix-missing \
libxml2 \
gcc \
vim \
iputils-ping \
telnet \
procps \
&& apt-get clean && rm -rf /tmp/* /var/tmp/*
RUN pip install ray
CMD ["echo", "Base Image Ready"]
docker-compose.yml
version: "3.5"
services:
ray-head:
image: simpled:0.1
shm_size: '2gb'
entrypoint: [ '/usr/local/bin/ray']
command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'ray-head', '--block']
ports:
- target: 6379
published: 6379
protocol: tcp
mode: host
- target: 6380
published: 6380
protocol: tcp
mode: host
- target: 6381
published: 6381
protocol: tcp
mode: host
- target: 12345
published: 12345
protocol: tcp
mode: host
- target: 12346
published: 12346
protocol: tcp
mode: host
deploy:
replicas: 1
placement:
constraints: [node.labels.Head == true ]
ray-worker:
image: simpled:0.1
shm_size: '2gb'
entrypoint: [ '/usr/local/bin/ray']
command: ['start', '--node-ip-address', 'ray-worker', '--redis-address', 'ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
ports:
- target: 12345
published: 12345
protocol: tcp
mode: host
- target: 12346
published: 12346
protocol: tcp
mode: host
depends_on:
- "ray-head"
deploy:
mode: global
placement:
constraints: [node.labels.Head != true]
Am i doing it wrong? Anyone that have gotten it to work in swarm mode.
EDIT 2019-04-14
Log of head:
[root@ray-node-1 bd-migratie-core]# docker service logs qaudt0j3clfv
[email protected] | 2019-04-14 17:49:34,187 INFO scripts.py:288 -- Using IP address 10.0.30.2 for this node.
[email protected] | 2019-04-14 17:49:34,190 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-34_1/logs.
[email protected] | 2019-04-14 17:49:34,323 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6379 to respond...
[email protected] | 2019-04-14 17:49:34,529 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6380 to respond...
[email protected] | 2019-04-14 17:49:34,538 INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
[email protected] | 2019-04-14 17:49:34,704 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6381 to respond...
[email protected] | 2019-04-14 17:49:34,714 INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
[email protected] | 2019-04-14 17:49:34,859 WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
[email protected] | 2019-04-14 17:49:34,862 INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
[email protected] | 2019-04-14 17:49:34,997 INFO scripts.py:319 --
[email protected] | Started Ray on this node. You can add additional nodes to the cluster by calling
[email protected] |
[email protected] | ray start --redis-address 10.0.30.2:6379
[email protected] |
[email protected] | from the node you wish to add. You can connect a driver to the cluster from Python by running
[email protected] |
[email protected] | import ray
[email protected] | ray.init(redis_address="10.0.30.2:6379")
[email protected] |
[email protected] | If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
[email protected] |
[email protected] | ray stop
ps aux inside head container:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 1.9 289800 70860 ? Ss 17:49 0:01 /usr/local/bin/python /usr/local/bin/ray start --head --redis-port 6379 --redis-shard-ports 6380,6381 --object-manager-port 12345 --node-manager-port 12346 --node-ip-addres
root 9 0.9 1.4 182352 50920 ? Rl 17:49 0:05 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
root 14 0.8 1.3 182352 48828 ? Rl 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6380
root 18 0.5 1.4 188496 52320 ? Sl 17:49 0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6381
root 22 3.1 1.9 283144 70132 ? S 17:49 0:17 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/monitor.py --redis-address=10.0.30.2:6379
root 23 0.7 0.0 15736 1852 ? S 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet_monitor 10.0.30.2 6379
root 25 0.0 0.0 1098804 1528 ? S 17:49 0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-34_1/sockets/plasma_store -m 1111605657 -d /tmp
root 26 0.5 0.0 32944 2524 ? Sl 17:49 0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/p
root 27 1.1 0.9 246340 35192 ? S 17:49 0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-34_1/logs
root 31 2.7 0.9 385800 35368 ? Sl 17:49 0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 32 2.7 0.9 385800 35364 ? Sl 17:49 0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 48 2.2 0.0 19944 2232 pts/0 Ss 17:59 0:00 bash
root 53 0.0 0.0 38376 1644 pts/0 R+ 17:59 0:00 ps aux
Log of worker:
[email protected] | 2019-04-14 17:49:35,716 INFO services.py:363 -- Waiting for redis server at 10.0.30.2:6379 to respond...
[email protected] | 2019-04-14 17:49:35,733 INFO scripts.py:363 -- Using IP address 10.0.30.5 for this node.
[email protected] | 2019-04-14 17:49:35,748 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-35_1/logs.
[email protected] | 2019-04-14 17:49:35,794 WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
[email protected] | 2019-04-14 17:49:35,796 INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
[email protected] | 2019-04-14 17:49:35,894 INFO scripts.py:371 --
[email protected] | Started Ray on this node. If you wish to terminate the processes that have been started, run
[email protected] |
[email protected] | ray stop
ps aux of worker:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 1.9 292524 70900 ? Ss 17:49 0:01 /usr/local/bin/python /usr/local/bin/ray start --node-ip-address ray-worker --redis-address ray-head:6379 --object-manager-port 12345 --node-manager-port 12346 --block
root 10 0.0 0.0 1098804 1532 ? S 17:49 0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-35_1/sockets/plasma_store -m 1111605657 -d /tmp
root 11 0.5 0.0 32944 2520 ? Sl 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/p
root 12 0.8 0.9 246320 35192 ? S 17:49 0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-35_1/logs
root 15 2.7 0.9 385800 35368 ? Sl 17:49 0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 16 2.7 0.9 385800 35360 ? Sl 17:49 0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 39 4.5 0.0 19944 2236 pts/0 Ss 18:01 0:00 bash
root 44 0.0 0.0 38376 1648 pts/0 R+ 18:01 0:00 ps aux
EDIT 2019-04-17
I know the reason why it doesn't work now, but not how to fix it.
If i log into the head container and check the ip where the ray processes are running
ray/monitor.py --redis-address=10.0.30.5:6379
This matches
/# ping ray-head
PING ray-head (10.0.30.5) 56(84) bytes of data.
64 bytes from 10.0.30.5 (10.0.30.5): icmp_seq=1 ttl=64 time=0.105 ms
But it doesn't match
/hostname -i
10.0.30.6
If i move the ray processes to start with --redis-address=10.0.30.6:6379 it works.
Upvotes: 2
Views: 3711
Reputation: 31
I found out how to fix it:
The hostname of the ray-head container is NOT 'ray-head', but 'tasks.ray-head'.
To make it work i needed to change the hostnames inside the docker-compose files like this:
For ray-head:
command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'tasks.ray-head', '--block']
For ray-worker:
command: ['start', '--redis-address', 'tasks.ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
Now i can run this on any host:
ray.init('tasks.ray-head:6379')
I hope this helps someone else in the same situation
Upvotes: 1