Andrex
Andrex

Reputation: 682

Dask parallelize a short task that uses a large np.ndarray

I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).

A for loop over a matrix M with many rows

for x in M:
    f(x)

takes about 5 times longer than parallelizing using multiprocessing

import multiprocessing

with multiprocessing.Pool() as pool:
    pool.map(f, M)

I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.

So the question is how to use dask with short tasks that take large data as input?

Upvotes: 0

Views: 164

Answers (1)

mdurant
mdurant

Reputation: 28673

Firstly, the dask documentation makes clear the following contraindications:

  • it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
  • if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
  • if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
  • dask multiprocessing should not generally be used if you can run distributed (i.e., always)
  • don't use a python loop over an array, you should vectorize

Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.

Upvotes: 2

Related Questions