Alana
Alana

Reputation: 89

Pandas groupby of specific catergorical column

With reference to Pandas groupby with categories with redundant nan

import pandas as pd

df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()

              VALUE
TEAM ID TYPE
1    1  A         1
        B         1
     2  A         1
        B         1
     4  A         0
        B         0
     5  A         0
        B         0
     8  A         0
        B         0
2    1  A         0
        B         0
     2  A         0
        B         0
     4  A         1
        B         0
     5  A         1
        B         0
     8  A         1
        B         0

Expected output

              VALUE
TEAM ID TYPE
1    1  A         1
        B         1
     2  A         1
        B         1
2    4  A         1
        B         0
     5  A         1
        B         0
     8  A         1
        B         0

I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.

Upvotes: 0

Views: 199

Answers (2)

Naveed
Naveed

Reputation: 11650

here is one way to do it, based on the data you shared

reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index

df.reset_index(inplace=True)

(df[df.groupby(['TEAM','ID'])['VALUE']
    .transform(lambda x: x.sum()>0)]
    .set_index(['TEAM','ID','TYPE']))

                 VALUE
TEAM    ID  TYPE    
1     1     A      1
            B      1
      2     A      1
            B      1
2     4     A      1
            B      0
      5     A      1
            B      0
      8     A      1
            B      0

Upvotes: 1

BeRT2me
BeRT2me

Reputation: 13242

What you want is a little abnormal, but we can force it there from a pivot table:

out = df.pivot_table(index=['TEAM', 'ID'], 
                     columns=['TYPE'],  
                     values=['VALUE'], 
                     aggfunc='sum', 
                     observed=True, # This is the key when working with categoricals~
                                    # You should known to try this with your groupby from the post you linked...
                     fill_value=0).stack()
print(out)

Output:

              VALUE
TEAM ID TYPE
1    1  A         1
        B         1
     2  A         1
        B         1
2    4  A         1
        B         0
     5  A         1
        B         0
     8  A         1
        B         0

Upvotes: 2

Related Questions