How does Dask leverage multithreading in the threaded scheduler?

Question

I am interested in the local threaded scheduler of Dask. This scheduler can load data blocks from a multidimensional array in "parallel" using several threads. I am interested in I/O bound problems so I am not considering compute intensive applications for the moment.

This fact seems verified by some speed tests I did on loading and saving data from a random array using Dask's store method: As the block size augment the performance decreases (supposedly because smallest chunks increase the parallelism). In this experiment I am working with hdf5 files with no physical chunks: 1 dataset containing all the data from the array.

The problem I have is two fold: 1) How can Dask have parallelism in reading data when reading on HDD is sequential? 2) How can Dask have parallelism in reading when the python GIL should prevent the threads from saving data in memory at the same time?

Thank you for your time.

MRocklin · Accepted Answer

How can Dask have parallelism in reading data when reading on HDD is sequential?

You're correct that if you are bound by reading from the hard disk then using multiple threads should not have any performance benefit.

However, it may be that there is work to do here other than reading from the hard disk.

Your data may be compressed in HDF format, requiring some CPU work to decompress it
You may be doing something with your data other than just reading it, and these operations can be interleaved with IO tasks

How can Dask have parallelism in reading when the python GIL should prevent the threads from saving data in memory at the same time?

Python's GIL isn't that much of a problem for numeric workloads, which do most of their computation in linked C/Fortran libraries. In general if you aree using Numpy-like libraries on numeric data then the GIL is unlikely to affect you.

How does Dask leverage multithreading in the threaded scheduler?

Answers (1)

Related Questions