Gowtam Chandrahasa
Gowtam Chandrahasa

Reputation: 21

Unable to Connect to Ray Cluster from machines other than the machine cluster was started from with local provider

Unable to Connect to Ray Cluster from machines other than the machine cluster was started from with local provider

We are using cluster launcher to start ray on a 3 machine cluster.

The machine where I'm calling ray up from is able to connect to the cluster via ray attach or to submit jobs using ray submit. But, when trying to connect to the cluster using the same cluster_launcher.yaml, we are seeing errors.

I checked that it's basically trying to read a certain /tmp/cluster-default.state file which ideally gives the non-terminated nodes of the ray_cluster. On the machine where ray up is done from, this file has the correct data. But, on the other machines from which I'm trying to connect, if the file doesn't exist, on running 'ray attach' the file is getting created with all nodes as terminated. If it already exists, it continues to use it. So, when I went ahead and actually edited the cluster-default.state file manually to have all the nodes as running 'ray attach' started working on this new machine.

But, the issue is that it's a bad hack and also that we never had this problem before, we were able to connect to this ray cluster using ray attach from any machine that was able to ssh into the nodes of the cluster. Not sure why this error started appearing now.



Error:

RuntimeError: Head node of cluster (default) not found!

Traceback (most recent call last):

File "/airflow/airflowenv/bin/ray", line 8, in sys.exit(main()) File "/airflow/airflowenv/lib64/python3.6/site-packages/ray/scripts/scripts.py", line 1923, in main return cli() File "/airflow/airflowenv/lib64/python3.6/site-packages/click/core.py", line 764, in call return self.main(*args, **kwargs) File "/airflow/airflowenv/lib64/python3.6/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/airflow/airflowenv/lib64/python3.6/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/airflow/airflowenv/lib64/python3.6/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/airflow/airflowenv/lib64/python3.6/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/airflow/airflowenv/lib64/python3.6/site-packages/ray/scripts/scripts.py", line 1078, in attach port_forward=port_forward) File "/airflow/airflowenv/lib64/python3.6/site-packages/ray/autoscaler/_private/commands.py", line 901, in attach_cluster port_forward=port_forward, File "/airflow/airflowenv/lib64/python3.6/site-packages/ray/autoscaler/_private/commands.py", line 945, in exec_cluster config, config_file, override_cluster_name, create_if_needed=start) File "/airflow/airflowenv/lib64/python3.6/site-packages/ray/autoscaler/_private/commands.py", line 1219, in _get_running_head_node config["cluster_name"]))

cluster_launcher.yaml

# A unique identifier for the head node and workers of this cluster.
cluster_name: default
## NOTE: Typically for local clusters, min_workers == max_workers == len(worker_ips).
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 2
# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
# upscaling_speed: 1.0
idle_timeout_minutes: 5
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "rayproject/ray@sha256:c3b15b82825d978fd068a1619e486020c7211545c80666804b08a95ef7665371" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: []  # Extra options to pass into "docker run"
# Local specific configuration.
provider:
    type: local
    head_ip: 10.xyz.abc.1
    worker_ips: [10.xyz.abc.2, 10.xyz.abc.3]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"
# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: myntra
    # Optional if an ssh private key is necessary to ssh to the cluster.
    ssh_private_key: ~/.ssh/id_rsa
# Leave this empty.
head_node: {}
# Leave this empty.
worker_nodes: {}
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
   "/home/ray/cdcs_and_competitor_brands_jobs": "/home/myntra/airflow/dags/cdcs_and_competitor_brands_jobs"
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
# cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: True
# Patterns for files to exclude when running rsync up or rsync down
# rsync_exclude:
#     - "**/.git"
#     - "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
# rsync_filter:
#     - ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # Uncomment the following line if you want to run the nightly version of ray (as opposed to the latest)
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
# Custom commands that will be run on the head node after common setup.
# Later have custom docker images with these dependencies pre-installed and pull them directly
head_setup_commands:
  - sudo apt update && sudo apt -y upgrade
  - sudo apt -y install curl
  - curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
  - sudo su -c 'curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list > /etc/apt/sources.list.d/mssql-release.list'
  - sudo apt-get -y update
  - sudo apt-get -y install vim
  - export DEBIAN_FRONTEND=noninteractive
  - sudo su -c 'echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections'
  - sudo ACCEPT_EULA=Y -E apt-get -y install msodbcsql17 mssql-tools
  - echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bash_profile
  - echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
  - source ~/.bashrc
  - sudo apt-get -y install unixodbc-dev unixodbc libpq-dev
  - conda install -y pyodbc=4.0.30
  - pip install azure-storage-blob==12.1.0
  - pip install azure-core==1.2.2
  - pip install statsmodels
  - pip install pmdarima
  - pip install python-dateutil
  - conda install -y -c conda-forge pystan
  - conda install -y -c conda-forge fbprophet
  - pip uninstall -y pandas
  - conda install -y pandas==1.3.1
  - sudo chown ray /home/ray/ray_bootstrap_key.pem
  - sudo chown ray /home/ray/ray_bootstrap_config.yaml
  - sudo touch /tmp/cluster-default.state
  - sudo chown ray:users /tmp/cluster-default.state
# Custom commands that will be run on worker nodes after common setup.
# Later have custom docker images with these dependencies pre-installed and pull them directly
worker_setup_commands:
  - sudo apt update && sudo apt -y upgrade
  - sudo apt -y install curl
  - curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
  - sudo su -c 'curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list > /etc/apt/sources.list.d/mssql-release.list'
  - sudo apt-get -y update
  - sudo apt-get -y install vim
  - export DEBIAN_FRONTEND=noninteractive
  - sudo su -c 'echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections'
  - sudo ACCEPT_EULA=Y -E apt-get -y install msodbcsql17 mssql-tools
  - echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bash_profile
  - echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
  - source ~/.bashrc
  - sudo apt-get -y install unixodbc-dev unixodbc libpq-dev
  - conda install -y pyodbc=4.0.30
  - pip install azure-storage-blob==12.1.0
  - pip install azure-core==1.2.2
  - pip install statsmodels
  - pip install pmdarima
  - pip install python-dateutil
  - conda install -y -c conda-forge pystan
  - conda install -y -c conda-forge fbprophet
  - pip uninstall -y pandas
  - conda install -y pandas==1.3.1
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host="0.0.0.0" --dashboard-port=8896
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

Upvotes: 2

Views: 3551

Answers (1)

Flavio
Flavio

Reputation: 131

I had the same problem a while ago, I had an out of memory problem on the head cluster that destroyed the configuration. This was due to the fact that /tmp/ folder on my machine has limited space (it's on SSD) so I changed the path of ray temporary folder on normal disk with the flag --temp-dir.

I followed these steps:

  1. stop ray on all nodes
  2. deleted all ray temp configuration files in /tmp/.
  3. restart the head cluster with the .yaml file with the flag --temp-dir
  4. connect again all the nodes to the head

After deleting temporary files and using a different temporary folder, I had no more problems.

I hope this helps you somehow.

Upvotes: 1

Related Questions