Bill Huang
Bill Huang

Reputation: 4648

Dask DataFrame groupby-size does not fit into memory

How to accomplish such a groupby-size task on a resource-limited machine?

My code looks like this:

import dask.dataframe as dd

ddf = dd.read_parquet(parquet_path)
sr = ddf.groupby(["col_1", "col_2"]).size()
sr.to_csv(csv_path)

My data:

The code worked correctly with a small sample but failed on a large sample. I don't know what options do I have for accomplishing such a task with an ordinary Win10 laptop having only 12GB memory installed.

Upvotes: 0

Views: 397

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

The number of unique combinations of col1 and col2 should fit into memory, ideally it should be a small fraction of the available worker memory. If that is true for your data, you could try specifying split_every option (see docs):

sr = ddf.groupby(["col_1", "col_2"]).size(split_every=2)

On a local machine, check that each worker has enough memory, with 12 GB memory, I would probably restrict it to 2 workers at most.

Also, you might find this answer to a related question helpful.

Upvotes: 1

Related Questions