Conditionally concatenate strings within a groupby aggregate function

Question

I have a dataset which I need to group by on single field and aggregate on multiple fields. As part of aggregate, I need to concatenate a string column values in a sorted order conditionally.

Input:

SYSTIME             TT  REC TABLE   cat_a   cat_b   cat_c
01/11/2019 00:00    2   102 A       1       0       0
01/11/2019 00:00    2   103 B       1       0       0
01/11/2019 00:00    2   150 C       0       1       0
01/11/2019 00:01    3   200 B       1       0       0
01/11/2019 00:01    3   150 A       1       0       0
01/11/2019 00:01    3   104 D       0       0       1
01/11/2019 00:02    4   200 F       1       0       0
01/11/2019 00:02    5   250 A       1       0       0
01/11/2019 00:02    2   120 C       0       1       0
01/11/2019 00:02    3   150 E       1       0       0

Output expected:

SYSTIME             TT  REC TABLE   cat_a   cat_b   cat_c
01/11/2019 00:00    2   355 A;B     2       1       0
01/11/2019 00:01    3   454 A;B     2       0       1
01/11/2019 00:02    5   520 A;E;F   3       1       0

I have the following code:

df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg({'TT' : 'max','REC' : 'sum','TABLE': ';'.join, 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})

The issue is with string concatenation, I want to concatenate TABLE values where cat_a=1 and also sorted order. Currently I am getting A;B;C for minute 00:00 but expect only A;B where cat_a=1

Is there a way to add condition to the join function?

P.S: I am new to python, I did see similar questions but I want specifically to add condition inside an agg function

Jim Eisenberg · Accepted Answer

I can't find a way to do this within agg so if anyone does please do say.

However it's easily done outside of agg, with:

df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(    #Remove TABLE from first agg
            {'TT' : 'max','REC' : 'sum', 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc = pd.merge(df_table_acc, df[df['cat_a']>0].copy().groupby(['SYSTIME'],as_index=False).agg(
            {'TABLE':';'.join}),how='left',on='SYSTIME')

This was edited for indexing issues. We are now using merge on SYSTIME to make sure the TABLE matches the SYSTIME

Alternatively, by changing the data, with a bit of cleanup afterwards (EDIT: fixed this part and added better separation)

import re
df['TABLE'] = df.apply(lambda x: x['TABLE'] if x['cat_a']>0 else '', axis=1)
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(
            {'TT' : 'max','REC' : 'sum','TABLE': ';'.join, 
             'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc.TABLE = df_table_acc.TABLE.apply(lambda x: re.sub(';+',';',x).strip(';'))
#Quick explanation: the re part avoids having repeat ";" eg: "A;;C;D;;G" -> "A;C;D;G"
#The strip removes outside strings eg: ";A;B;" -> "A;B"

Make sure you don't need the TABLE column for anything else before using the second method, or use a dummy column like TABLE2 or something.

Conditionally concatenate strings within a groupby aggregate function

Answers (2)

Related Questions