Reputation: 682
I have a function f
that uses as input a variable x
which is a large np.ndarray
(lenght 20000).
Execution of f
takes very little (about 5ms).
A for
loop over a matrix M
with many rows
for x in M:
f(x)
takes about 5 times longer than parallelizing using multiprocessing
import multiprocessing
with multiprocessing.Pool() as pool:
pool.map(f, M)
I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag
. I'm running Dask in local machine with 4 physical cores.
So the question is how to use dask with short tasks that take large data as input?
Upvotes: 0
Views: 164
Reputation: 28673
Firstly, the dask documentation makes clear the following contraindications:
Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork
, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork
. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.
Upvotes: 2