Reputation: 4648
How to accomplish such a groupby-size task on a resource-limited machine?
My code looks like this:
import dask.dataframe as dd
ddf = dd.read_parquet(parquet_path)
sr = ddf.groupby(["col_1", "col_2"]).size()
sr.to_csv(csv_path)
My data:
.to_parquet(append=True)
on the same machine, so I didn't run into memory issues when generating the data.col_1
and col_2
have data type uint64
.The code worked correctly with a small sample but failed on a large sample. I don't know what options do I have for accomplishing such a task with an ordinary Win10 laptop having only 12GB memory installed.
Upvotes: 0
Views: 397
Reputation: 16551
The number of unique combinations of col1
and col2
should fit into memory, ideally it should be a small fraction of the available worker memory. If that is true for your data, you could try specifying
split_every
option (see docs):
sr = ddf.groupby(["col_1", "col_2"]).size(split_every=2)
On a local machine, check that each worker has enough memory, with 12 GB memory, I would probably restrict it to 2 workers at most.
Also, you might find this answer to a related question helpful.
Upvotes: 1