nz_21
nz_21

Reputation: 7403

How does dask achieve parallelism?

I don't quite understand dask's parallelism model (https://docs.dask.org/en/latest/delayed-best-practices.html)

Given that python is single-threaded, what performance benefit can delayed actually offer? My understanding is it infers independent processes/functions as parts of a graph and then executes them in "parallel", but how is that possible?

I see how they might be "concurrent" processes, but even so - given that the function is sync, how can it perform any concurrent processes?

Upvotes: 0

Views: 54

Answers (1)

mdurant
mdurant

Reputation: 28684

Simple: python is not "single-threaded", it can run many threads simultaneously. You are maybe thinking of the global interpreter lock (GIL), which makes the interpreter run exactly one operation at a time from one of the threads. Many libraries do not need to hold the GIL, however, so thread-based parallelism is real and useful in many cases. This will generally be true for numerical libraries (pandas...) and other things that do most of their work in compiled C/C++ code.

In addition, Dask supports process-based parallelism, that bypasses the GIL issue, but at the cost of communication and memory overhead. Whether this is better or worse for you will depend on your workload.

Finally, the distributed scheduler is ideal even on a single machine, because it enables you to choose the threads/processes mix that is right for whatever you are doing.

Upvotes: 2

Related Questions