dask dataframe optimal partition size for 70GB data join operations

Question

I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.

I have to take each of the 3 columns and join them to another even larger dataframe.

The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.

Is there a reason to keep it at 700 x 100MB partitions? If not which partition size should be used here? Does this also depend on the number of workers I use?

1 x 50GB worker
2 x 25GB worker
3 x 17GB worker

MRocklin · Accepted Answer

Optimal partition size depends on many different things, including available RAM, the number of threads you're using, how large your dataset is, and in many cases the computation that you're doing.

For example, in your case if your join/merge code it could be that your data is highly repetitive, and so your 100MB partitions may quickly expand out 100x to 10GB partitions, and quickly fill up memory. Or they might not; it depends on your data. On the other hand join/merge code does produce n*log(n) tasks, so reducing the number of tasks (and so increasing partition size) can be highly advantageous.

Determining optimal partition size is challenging. Generally the best we can do is to provide insight about what is going on. That is available here:

dask dataframe optimal partition size for 70GB data join operations

Answers (1)

Related Questions