Labo
Labo

Reputation: 2772

Dask: many small workers vs a big worker

I am trying to understand this simple example from the dask-jobqueue documentation:

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,
                    memory"100GB",
                    project='P48500028',
                    queue='premium',
                    walltime='02:00:00')

cluster.start_workers(100)  # Start 100 jobs that match the description above

from dask.distributed import Client
client = Client(cluster)    # Connect to that cluster

I think it means that there will be 100 jobs each using 36 cores.

Let's say I can use 48 cores on a cluster.

Should I use 1 worker with 48 cores or 48 workers of 1 core each?

Upvotes: 1

Views: 609

Answers (1)

MRocklin
MRocklin

Reputation: 57261

If your computations mostly release the GIL, then you'll probably want several threads per process. This is true if you're doing mostly Numpy, Pandas, Scikit-Learn, Numba/Cython programming on numeric data. I might do something like six processes with eight cores each.

If your computations are mostly pure Python code, for example you process text data, or iterate heavily with Python for loops over dicts/list/etc then you'll want fewer threads per process, maybe two.

Upvotes: 3

Related Questions