pygabriel
pygabriel

Reputation: 10008

How to keep partitions after performing a group-by aggregation in dask

In my application I perform an aggregation on a dask dataframe using groupby, ordered by a certain id.

However I would like that the aggregation maintains the partition divisions, as I intend to perform joins with other dataframe identically partitioned.

import pandas as pd
import numpy as np
import dask.dataframe as dd

df =pd.DataFrame(np.arange(16), columns=['my_data'])
df.index.name = 'my_id'

ddf = dd.from_pandas(df, npartitions=4)
ddf.npartitions
# 4

ddf.divisions
# (0, 4, 8, 12, 15)

aggregated = ddf.groupby('my_id').agg({'my_data': 'count'})
aggregated.divisions
# (None, None)

Is there a way to accomplish that?

Upvotes: 4

Views: 1035

Answers (1)

MRocklin
MRocklin

Reputation: 57271

You probably can't maintain the same partitioning, because dask will need to aggregate counts between partitions. Your data will necessarily have to move around in ways that depend on the values of your data.

If you're looking to ensure that your output has many partitions then you might choose to use the split_out= keyword to agg

Upvotes: 2

Related Questions