bucky
bucky

Reputation: 392

Groupby multiple columns and aggregation with dask

dask dataframe looks like this:

A     B     C     D
1     foo   xx    this
1     foo   xx    belongs
1     foo   xx    together
4     bar   xx    blubb

i want to groupy by columns A,B,C and join the strings from D with a blank between, to get

A     B     C     D
1     foo   xx    this belongs together
4     bar   xx    blubb

i see how to do this with pandas:

df_grouped = df.groupby(['A','B','C'])['D'].agg(' '.join).reset_index()

how can this be achieved with dask?

Upvotes: 4

Views: 4963

Answers (2)

KRKirov
KRKirov

Reputation: 4004

ddf = ddf.groupby(['A','B','C'])['D'].apply(lambda row: ' '.join(row)).reset_index()
ddf.compute()

Output:

Out[75]: 
   A    B   C                      D
0  1  foo  xx  this belongs together
0  4  bar  xx                  blubb

Upvotes: 3

MRocklin
MRocklin

Reputation: 57251

You could use a CustomAggregation, where both the per-chunk and aggregation operations are your ' '.join method.

https://docs.dask.org/en/latest/dataframe-api.html#custom-aggregation

Upvotes: 1

Related Questions