Dask.groupby turns multiple partitions into one

Question

I have a dask.dataframe

df2 = dd.read_csv(path, dtype=dtypes, sep=',', error_bad_lines=False)

which is split into 220 partitions by dask itself

print(df2.npartitions)
>>220

I'd like to use groupby twice and save two dataframes into files

coccurrence_df = df2.groupby(['h1_h2', 'hashtag1','hashtag2','user_id']).count().reset_index()\
            .groupby(['h1_h2', 'hashtag1','hashtag2']).message_id.count().reset_index()\
            .rename(columns={"message_id":"coccurrence"})
strong_edges_df = coccurrence_df[coccurrence_df['coccurrence']>1].to_csv(path1, compute=False)
weak_edges_df = coccurrence_df[coccurrence_df['coccurrence']==1].to_csv(path2, compute=False)
dask.compute(strong_edges_df,weak_edges_df)

Why coccurrence_df is split into 1 partition when the dataframe it is created from is split into 220 partitions?

print(coccurrence_df.npartitions)
>>1

I believe because of this I'm losing parallelism, am I right? Thank you in advance

MRocklin · Accepted Answer

Groupby aggregations do parallel computation but result in a single partition output. If you have many groups and want to have a multi-partition output then consider using the split_out= parameter to the groupby aggregation.

I don't recommend doing this though if things work ok. I recommend just using the defaults until something is obviously performing poorly.

Dask.groupby turns multiple partitions into one

Answers (1)

Related Questions