Reputation: 189
I have a dask.dataframe
df2 = dd.read_csv(path, dtype=dtypes, sep=',', error_bad_lines=False)
which is split into 220 partitions by dask
itself
print(df2.npartitions)
>>220
I'd like to use groupby
twice and save two dataframes into files
coccurrence_df = df2.groupby(['h1_h2', 'hashtag1','hashtag2','user_id']).count().reset_index()\
.groupby(['h1_h2', 'hashtag1','hashtag2']).message_id.count().reset_index()\
.rename(columns={"message_id":"coccurrence"})
strong_edges_df = coccurrence_df[coccurrence_df['coccurrence']>1].to_csv(path1, compute=False)
weak_edges_df = coccurrence_df[coccurrence_df['coccurrence']==1].to_csv(path2, compute=False)
dask.compute(strong_edges_df,weak_edges_df)
Why coccurrence_df
is split into 1 partition when the dataframe it is created from is split into 220 partitions?
print(coccurrence_df.npartitions)
>>1
I believe because of this I'm losing parallelism, am I right? Thank you in advance
Upvotes: 0
Views: 205
Reputation: 57251
Groupby aggregations do parallel computation but result in a single partition output. If you have many groups and want to have a multi-partition output then consider using the split_out=
parameter to the groupby aggregation.
I don't recommend doing this though if things work ok. I recommend just using the defaults until something is obviously performing poorly.
Upvotes: 2