Amol
Amol

Reputation: 386

Pandas group by and sort by columns and need to add comma separated entries

we have below Pandas Dataframe

enter image description here

Desired output:

enter image description here

I want to do comma separated columns group by data column and order by order column

I tried below code but it orders on its own.

data= (df['section'].str.split(', ')
                    .groupby(df['data'])
                    .agg(lambda x: ', '.join(set(y for z in x for y in z)))
                    .reset_index())

Upvotes: 2

Views: 620

Answers (1)

jezrael
jezrael

Reputation: 863166

Use DataFrame.sort_values before groupby and if need same order like original by data column add DataFrame.reindex by unique values created by Series.unique:

df1 = (df.sort_values(['data','order'])
         .groupby('data')['section']
         .agg(','.join)
         .reindex(df['data'].unique())
         .reset_index())
print (df1)
        data                              section
0     google  code,reason,allergy,advance,service
1  Microsoft                          ab,ef,cd,gf

If possible data joined by , first sort data and then for unique values is used dictionary, because sets has not defined order:

print (df)
        data     section  order
0     google  code, code      1
1     google     allergy      3
2     google      reason      2
3     google     service      5
4     google     advance      4
5  Microsoft  ab, ab, kl      1
6  Microsoft  eh, eh, cd      3
7  Microsoft          ef      2
8  Microsoft          gf      4


df1 = df.sort_values(['data','order'])

data= (df1['section'].str.split(', ')
                     .groupby(df['data'])
                     .agg(lambda x: ', '.join(dict.fromkeys(y for z in x for y in z).keys()))
                     .reindex(df['data'].unique())
                     .reset_index()
                    )
print (data)
        data                                  section
0     google  code, reason, allergy, advance, service
1  Microsoft                   ab, kl, ef, eh, cd, gf

Alternative for remove duplicates by DataFrame.explode with DataFrame.drop_duplicates:

df1 = df.sort_values(['data','order'])

data= (df1.assign(section= df1['section'].str.split(', '))
          .explode('section')
          .drop_duplicates(['data', 'section','order'])
          .groupby('data')['section']
          .agg(', '.join)
          .reindex(df['data'].unique())
          .reset_index()
                    )
print (data)
        data                                  section
0     google  code, reason, allergy, advance, service
1  Microsoft                   ab, kl, ef, eh, cd, gf

Upvotes: 2

Related Questions