Anders Levin
Anders Levin

Reputation: 31

How to use ray in a docker swarm

I'm trying to set up a cluster with one ray-head and two ray-workers with docker swarm. I have three machines for this, one running the ray-had and two each one ray-worker. The cluster comes up ok, but whenever i exec into a container and run:

import ray
ray.init(redis-address='ray-head:6379')

i get

WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

The logs of the containers looḱ ok.

I have also with IPs as well, both the machine and the ip of the ray-head container.

ray.init(redis-address='192.168.30.193:6379')

When running:

telnet 192.168.30.193 6379

there is an answer.

Dockerfile for the containers:

FROM python:2.7-slim

RUN apt-get -y update
RUN apt-get install -y --fix-missing \
    libxml2 \
    gcc \
    vim \
    iputils-ping \
    telnet \
    procps \
    && apt-get clean && rm -rf /tmp/* /var/tmp/*

RUN pip install ray

CMD ["echo", "Base Image Ready"]

docker-compose.yml

version: "3.5"

services:
  ray-head:
    image: simpled:0.1
    shm_size: '2gb'
    entrypoint: [ '/usr/local/bin/ray']
    command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'ray-head', '--block']
    ports:
      - target: 6379
        published: 6379
        protocol: tcp
        mode: host
      - target: 6380
        published: 6380
        protocol: tcp
        mode: host
      - target: 6381
        published: 6381
        protocol: tcp
        mode: host
      - target: 12345
        published: 12345
        protocol: tcp
        mode: host
      - target: 12346
        published: 12346
        protocol: tcp
        mode: host
    deploy:
      replicas: 1
      placement:
        constraints: [node.labels.Head == true ]
  ray-worker:
    image: simpled:0.1
    shm_size: '2gb'
    entrypoint: [ '/usr/local/bin/ray']
    command: ['start', '--node-ip-address', 'ray-worker', '--redis-address', 'ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
    ports:
      - target: 12345
        published: 12345
        protocol: tcp
        mode: host
      - target: 12346
        published: 12346
        protocol: tcp
        mode: host
    depends_on:
      - "ray-head"
    deploy:
      mode: global
      placement:
        constraints: [node.labels.Head != true]

Am i doing it wrong? Anyone that have gotten it to work in swarm mode.

EDIT 2019-04-14

Log of head:

[root@ray-node-1 bd-migratie-core]# docker service logs qaudt0j3clfv
[email protected]    | 2019-04-14 17:49:34,187  INFO scripts.py:288 -- Using IP address 10.0.30.2 for this node.
[email protected]    | 2019-04-14 17:49:34,190  INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-34_1/logs.
[email protected]    | 2019-04-14 17:49:34,323  INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6379 to respond...
[email protected]    | 2019-04-14 17:49:34,529  INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6380 to respond...
[email protected]    | 2019-04-14 17:49:34,538  INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
[email protected]    | 2019-04-14 17:49:34,704  INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6381 to respond...
[email protected]    | 2019-04-14 17:49:34,714  INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
[email protected]    | 2019-04-14 17:49:34,859  WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
[email protected]    | 2019-04-14 17:49:34,862  INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
[email protected]    | 2019-04-14 17:49:34,997  INFO scripts.py:319 -- 
[email protected]    | Started Ray on this node. You can add additional nodes to the cluster by calling
[email protected]    | 
[email protected]    |     ray start --redis-address 10.0.30.2:6379
[email protected]    | 
[email protected]    | from the node you wish to add. You can connect a driver to the cluster from Python by running
[email protected]    | 
[email protected]    |     import ray
[email protected]    |     ray.init(redis_address="10.0.30.2:6379")
[email protected]    | 
[email protected]    | If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
[email protected]    | 
[email protected]    |     ray stop

ps aux inside head container:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  1.9 289800 70860 ?        Ss   17:49   0:01 /usr/local/bin/python /usr/local/bin/ray start --head --redis-port 6379 --redis-shard-ports 6380,6381 --object-manager-port 12345 --node-manager-port 12346 --node-ip-addres
root         9  0.9  1.4 182352 50920 ?        Rl   17:49   0:05 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
root        14  0.8  1.3 182352 48828 ?        Rl   17:49   0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6380
root        18  0.5  1.4 188496 52320 ?        Sl   17:49   0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6381
root        22  3.1  1.9 283144 70132 ?        S    17:49   0:17 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/monitor.py --redis-address=10.0.30.2:6379
root        23  0.7  0.0  15736  1852 ?        S    17:49   0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet_monitor 10.0.30.2 6379
root        25  0.0  0.0 1098804 1528 ?        S    17:49   0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-34_1/sockets/plasma_store -m 1111605657 -d /tmp
root        26  0.5  0.0  32944  2524 ?        Sl   17:49   0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/p
root        27  1.1  0.9 246340 35192 ?        S    17:49   0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-34_1/logs
root        31  2.7  0.9 385800 35368 ?        Sl   17:49   0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        32  2.7  0.9 385800 35364 ?        Sl   17:49   0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        48  2.2  0.0  19944  2232 pts/0    Ss   17:59   0:00 bash
root        53  0.0  0.0  38376  1644 pts/0    R+   17:59   0:00 ps aux

Log of worker:

[email protected]    | 2019-04-14 17:49:35,716        INFO services.py:363 -- Waiting for redis server at 10.0.30.2:6379 to respond...
[email protected]    | 2019-04-14 17:49:35,733        INFO scripts.py:363 -- Using IP address 10.0.30.5 for this node.
[email protected]    | 2019-04-14 17:49:35,748        INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-35_1/logs.
[email protected]    | 2019-04-14 17:49:35,794        WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
[email protected]    | 2019-04-14 17:49:35,796        INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
[email protected]    | 2019-04-14 17:49:35,894        INFO scripts.py:371 -- 
[email protected]    | Started Ray on this node. If you wish to terminate the processes that have been started, run
[email protected]    | 
[email protected]    |     ray stop

ps aux of worker:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  1.9 292524 70900 ?        Ss   17:49   0:01 /usr/local/bin/python /usr/local/bin/ray start --node-ip-address ray-worker --redis-address ray-head:6379 --object-manager-port 12345 --node-manager-port 12346 --block
root        10  0.0  0.0 1098804 1532 ?        S    17:49   0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-35_1/sockets/plasma_store -m 1111605657 -d /tmp
root        11  0.5  0.0  32944  2520 ?        Sl   17:49   0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/p
root        12  0.8  0.9 246320 35192 ?        S    17:49   0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-35_1/logs
root        15  2.7  0.9 385800 35368 ?        Sl   17:49   0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        16  2.7  0.9 385800 35360 ?        Sl   17:49   0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        39  4.5  0.0  19944  2236 pts/0    Ss   18:01   0:00 bash
root        44  0.0  0.0  38376  1648 pts/0    R+   18:01   0:00 ps aux

EDIT 2019-04-17

I know the reason why it doesn't work now, but not how to fix it.

If i log into the head container and check the ip where the ray processes are running

ray/monitor.py --redis-address=10.0.30.5:6379

This matches

/# ping ray-head
PING ray-head (10.0.30.5) 56(84) bytes of data.
64 bytes from 10.0.30.5 (10.0.30.5): icmp_seq=1 ttl=64 time=0.105 ms

But it doesn't match

/hostname -i
10.0.30.6

If i move the ray processes to start with --redis-address=10.0.30.6:6379 it works.

Upvotes: 2

Views: 3711

Answers (1)

Anders Levin
Anders Levin

Reputation: 31

I found out how to fix it:

The hostname of the ray-head container is NOT 'ray-head', but 'tasks.ray-head'.

To make it work i needed to change the hostnames inside the docker-compose files like this:

For ray-head:

command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'tasks.ray-head', '--block']

For ray-worker:

command: ['start', '--redis-address', 'tasks.ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']

Now i can run this on any host:

ray.init('tasks.ray-head:6379')

I hope this helps someone else in the same situation

Upvotes: 1

Related Questions