Latent
Latent

Reputation: 596

panda group on set column and get corresponding max value from another column

My data is as follow :

 set_name |  cases | label

0 {a,b,c}    7        'a b c'
1 {j,i,a}    2        'j i a'
2 {a,c}      1        'a c'
3 {b,c,a}    2        'b c a'

i would like to groupby on set_name with sum on cases . but - i need the label to correspond to the row with max cases . so the output will look like :

 set_name |  cases | label

0 {a,b,c}    9        'a b c'
1 {j,i,a}    2        'j i a'
2 {a,c}      1        'a c'

Since both {a,b,c} and {b,c,a} are equal (they are set) i want to sum them up. i've tried using the solution suggested here but since groupby cant work on sets i cant proceed with it

Upvotes: 1

Views: 58

Answers (2)

jezrael
jezrael

Reputation: 862396

Becuase sets are unhashable one possible solution is use frozensets and aggregate by GroupBy.agg:

df1 = (df.sort_values(by='cases', ascending=False)
         .groupby(df['set_name'].apply(frozenset))
         .agg({'set_name': 'first',
               'cases':'sum',
               'label':'first'})
         .reset_index(drop=True))
print (df1)
    set_name  cases  label
0  {a, c, b}      9  a b c
1  {i, j, a}      2  j i a
2     {a, c}      1    a c

Upvotes: 2

Allen Qin
Allen Qin

Reputation: 19947

(
    df.sort_values(by='cases', ascending=False)
    .groupby(df.set_name.apply(sorted).apply(str))
    .agg(set_name=('set_name', 'first'),
         cases=('cases', 'sum'),
         lalbel=('label', 'first'),
        )
    .reset_index(drop=True)
)

Upvotes: 1

Related Questions