emmanuelsa
emmanuelsa

Reputation: 687

Pool in a Ray cluster is sending the same number of jobs to different nodes even though the nodes have different sizes/different number of CPUs

I am using Pool in a Ray cluster. I want to be able to scale the number of jobs sent to different nodes proportionately to the compute capability (e.g., the number of CPUs) that each node has. Unfortunately, the Ray cluster pool I set up is sending the same number of jobs to different nodes even though the nodes have different sizes/different numbers of CPUs.

For example, I have four nodes in the cluster

Node 1: 64 CPUs [Head]
Node 2: 64 CPUs
Node 3: 48 CPUs
Node 4: 16 CPUs

While initializing the clusters/adding each of the nodes, I specified to Ray the number of CPUs that each of the nodes has. However, if I try to run highly parallelizable/completely independent processes, say 192 processes, the Ray sends 192/4 = 48 to each of the nodes. This is not desirable because the processes in Node 4 finish last (taking about 3 times the duration the other nodes take), while the processes in Nodes 1 to 3 finish almost at the same time (and are three times faster than Node 1). So, the entire job is slowed down by Node 1.

What I intend to achieve is (64:64:48:16) as follows.

Node 1: 64 processes
Node 2: 64 processes
Node 3: 48 processes
Node 4: 16 processes

What Ray is currently doing is (48:48:48:48) as follows.

Node 1: 48 processes
Node 2: 48 processes
Node 3: 48 processes
Node 4: 48 processes

I will appreciate any suggestions that can help me achieve something close to 64:64:48:16 (rather than 48:48:48:48).

The following is a sample code I am using to test/debug the cluster.

import ray
from ray.util.multiprocessing import Pool

ray.init(address='auto')

pool_1 = Pool()

# A slow enough CPU-intensive function just to test the idea
def fibonacci(n):
    if n<= 0:
        "Invalid"
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

results = pool_1.map(fibonacci, [38]*192)

Upvotes: 3

Views: 1235

Answers (1)

Alex
Alex

Reputation: 1448

You can use Pool(ray_remote_args={"num_cpus": 1}) to make the workers require a CPU (then you won't be able to place more than 16 workers on the 16 cpu node, so those workers will be forced to the other nodes).

Upvotes: 1

Related Questions