Reputation: 13
I have a dataset which I need to group by on single field and aggregate on multiple fields. As part of aggregate, I need to concatenate a string column values in a sorted order conditionally.
Input:
SYSTIME TT REC TABLE cat_a cat_b cat_c
01/11/2019 00:00 2 102 A 1 0 0
01/11/2019 00:00 2 103 B 1 0 0
01/11/2019 00:00 2 150 C 0 1 0
01/11/2019 00:01 3 200 B 1 0 0
01/11/2019 00:01 3 150 A 1 0 0
01/11/2019 00:01 3 104 D 0 0 1
01/11/2019 00:02 4 200 F 1 0 0
01/11/2019 00:02 5 250 A 1 0 0
01/11/2019 00:02 2 120 C 0 1 0
01/11/2019 00:02 3 150 E 1 0 0
Output expected:
SYSTIME TT REC TABLE cat_a cat_b cat_c
01/11/2019 00:00 2 355 A;B 2 1 0
01/11/2019 00:01 3 454 A;B 2 0 1
01/11/2019 00:02 5 520 A;E;F 3 1 0
I have the following code:
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg({'TT' : 'max','REC' : 'sum','TABLE': ';'.join, 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
The issue is with string concatenation, I want to concatenate TABLE values where cat_a=1 and also sorted order. Currently I am getting A;B;C for minute 00:00 but expect only A;B where cat_a=1
Is there a way to add condition to the join function?
P.S: I am new to python, I did see similar questions but I want specifically to add condition inside an agg function
Upvotes: 1
Views: 405
Reputation: 54
you problem can be solved by filtering your data before applying groupby and agg
function.Just add df=df[df['cat_a']==1].sort_values('TABLE')
before using this code
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg({'TT' : 'max','REC' : 'sum','TABLE': ';'.join, 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
Upvotes: 0
Reputation: 1500
I can't find a way to do this within agg
so if anyone does please do say.
However it's easily done outside of agg
, with:
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg( #Remove TABLE from first agg
{'TT' : 'max','REC' : 'sum', 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc = pd.merge(df_table_acc, df[df['cat_a']>0].copy().groupby(['SYSTIME'],as_index=False).agg(
{'TABLE':';'.join}),how='left',on='SYSTIME')
This was edited for indexing issues. We are now using merge
on SYSTIME
to make sure the TABLE
matches the SYSTIME
Alternatively, by changing the data, with a bit of cleanup afterwards (EDIT: fixed this part and added better separation)
import re
df['TABLE'] = df.apply(lambda x: x['TABLE'] if x['cat_a']>0 else '', axis=1)
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(
{'TT' : 'max','REC' : 'sum','TABLE': ';'.join,
'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc.TABLE = df_table_acc.TABLE.apply(lambda x: re.sub(';+',';',x).strip(';'))
#Quick explanation: the re part avoids having repeat ";" eg: "A;;C;D;;G" -> "A;C;D;G"
#The strip removes outside strings eg: ";A;B;" -> "A;B"
Make sure you don't need the TABLE
column for anything else before using the second method, or use a dummy column like TABLE2
or something.
Upvotes: 1