Reputation: 11
I'm trying to setup a Ray cluster for parallel processing, I have 3 on-premise machines each with 12 CPUs, and each actor is assigned 1 CPU. I'm deploying the head manually with:
ray start --head --port=... --redis-shard-ports=... --node-manager-port=... --object-manager-port=... --min-worker-port=... --max-worker-port=... --ray-client-server-port=... --gcs-server-port=... --num-cpus=12
and each worker with:
ray start --address='<HEAD_IP>' --redis-password='...' --node-manager-port=... --object-manager-port=... --min-worker-port=... --max-worker-port=... --dashboard-port=... --gcs-server-port=... --num-cpus=12
Each worker uses a hefty amount of memory, the issue is that Ray keeps assigning workers to the head node until it runs out of memory and crashes, meanwhile the other worker nodes aren't utilized.
Upvotes: 1
Views: 1005
Reputation: 1448
It would help to know what the workload you're trying to run is, but in general, you can encourage tasks to be spread out more via a scheduling strategy.
@ray.remote(scheduling_strategy="SPREAD")
def my_task():
print("Running")
time.sleep(10)
ray.get([my_task.remote() for _ in range (6)])
Note that spread scheduling has its tradeoffs. In some cases, it may be good to spread you tasks to avoid noisy neighbors or increase fault tolerance. This comes at the cost that you're more likely to have to transfer data between machines if other tasks depend on the output of these tasks.
Upvotes: 2