Reputation: 10008
In my application I perform an aggregation on a dask dataframe using groupby, ordered by a certain id.
However I would like that the aggregation maintains the partition divisions, as I intend to perform joins with other dataframe identically partitioned.
import pandas as pd
import numpy as np
import dask.dataframe as dd
df =pd.DataFrame(np.arange(16), columns=['my_data'])
df.index.name = 'my_id'
ddf = dd.from_pandas(df, npartitions=4)
ddf.npartitions
# 4
ddf.divisions
# (0, 4, 8, 12, 15)
aggregated = ddf.groupby('my_id').agg({'my_data': 'count'})
aggregated.divisions
# (None, None)
Is there a way to accomplish that?
Upvotes: 4
Views: 1035
Reputation: 57271
You probably can't maintain the same partitioning, because dask will need to aggregate counts between partitions. Your data will necessarily have to move around in ways that depend on the values of your data.
If you're looking to ensure that your output has many partitions then you might choose to use the split_out=
keyword to agg
Upvotes: 2