Dask DataFrame groupby-size does not fit into memory

Question

How to accomplish such a groupby-size task on a resource-limited machine?

My code looks like this:

import dask.dataframe as dd

ddf = dd.read_parquet(parquet_path)
sr = ddf.groupby(["col_1", "col_2"]).size()
sr.to_csv(csv_path)

My data:

The parquet file is around 7GB and 300M records in total. Furthermore, it is expected to be 3 times bigger after more data is appended.
The parquet file consists of 30 parts with each part being around 235MB. The parts are output by batch using .to_parquet(append=True) on the same machine, so I didn't run into memory issues when generating the data.
Both col_1 and col_2 have data type uint64.

The code worked correctly with a small sample but failed on a large sample. I don't know what options do I have for accomplishing such a task with an ordinary Win10 laptop having only 12GB memory installed.

SultanOrazbayev · Accepted Answer

The number of unique combinations of col1 and col2 should fit into memory, ideally it should be a small fraction of the available worker memory. If that is true for your data, you could try specifying split_every option (see docs):

sr = ddf.groupby(["col_1", "col_2"]).size(split_every=2)

On a local machine, check that each worker has enough memory, with 12 GB memory, I would probably restrict it to 2 workers at most.

Also, you might find this answer to a related question helpful.

Dask DataFrame groupby-size does not fit into memory

Answers (1)

Related Questions