Reputation: 2273
I have a dataset
id category description status
11 A Text_1 Finished
11 A Text_2 Pause
11 A Text_3 Started
22 A Text_1 Pause
33 B Text_1 Finished
33 B Text_2 Finished
And I want to group data with id
and concat description
only for raws with status = 'Finished'
So desired output is
id category description
11 A Text_1
22 A
33 B Text_1 Text_2
I can concat it using
data.groupby(['id', 'category'])['description'].apply(' '.join).reset_index()
But how can I use condition inside this expression?
Upvotes: 0
Views: 1990
Reputation: 9379
Here's a way:
key = ['id', 'category']
df2 = data[key].drop_duplicates().join(
data.query("status == 'Finished'").groupby(key).description.apply(' '.join),
on=key).fillna('').reset_index(drop=True)
Explanation:
query()
to filter on status for "Finished", use groupby()
to group by the key
[id, category], then use your str.join()
logic on the description
column values within each groupkey
columns and DataFrame.join()
to expand the filtered results to contains all key
values, and use fillna()
to replace NaN
with an empty string in the description
column for keys that were filtered out.Output:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
Upvotes: 1
Reputation: 261860
You can use groupby.apply
with a conditional and a default value if the group is empty after filtering:
out = (df
.groupby(['id', 'category'])
.apply(lambda g: ' '.join(d['description'])
if len(d:=g[g['status'].eq('Finished')])
else '' )
.reset_index(name='description')
)
Output:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
Upvotes: 2
Reputation: 323326
you can filter before groupby
then reindex
with the missing groups
out = data.loc[data.status == 'Finished'].groupby(['id', 'category'])['description'].apply(' '.join).reindex(pd.MultiIndex.from_frame(data[['id','category']].drop_duplicates()),fill_value= ' ').reset_index()
Out[70]:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
Upvotes: 3