Alex P
Alex P

Reputation: 137

Pandas group by and make a summary

I have a dataframe with two columns ID and labels. Labels can only be 0 or 1.

The code below generates such a dataframe

data = [[10105, 1], [10105, 1], [10105, 0], [20205, 0], [20205, 0], [20205, 1], [20205, 1]] 

test=pd.DataFrame(data,columns=["ID","label"])

test
      ID  label
0  10105      1
1  10105      1
2  10105      0
3  20205      0
4  20205      0
5  20205      1
6  20205      1

I would like to get some statistics about the labels once data is grouped by ID.

The

test.groupby('ID')

will group the entries by ID but then I want to see how many entries with the ID 10105 have a label of 1 and how many have a label of a 0. Also I would like to calculate the percentage of 0s. That would be then the ideal output

ID 10105, label1: 2, label0: 1, Percantage (label0/(label1+label0)): 1/3
ID 20205, label1: 2, label0: 2, Percantage (label0/(label1+label0)): 2/4

I think python has a way to aggregate results but at the same time I need a way to make calculations between the labels of a specific ID.

Can you please help me?

I would like to thank you in advance for your reply.

Regards Alex

Upvotes: 3

Views: 5283

Answers (2)

jezrael
jezrael

Reputation: 862761

You can use GroupBy.agg with tuples for new column name with aggregate function - for label0 and Percantage compare values by 0 and count by sum, mean for percentage. It working, because Trues are processing like 1.

df = test.groupby('ID')['label'].agg([('label1','sum'),
                                      ('label0',lambda x: x.eq(0).sum()),
                                      ('Percantage',lambda x: x.eq(0).mean())])
print (df)
       label1  label0  Percantage
ID                               
10105       2       1    0.333333
20205       2       2    0.500000

If use pandas 0.25+:

def label0(x):
    return x.eq(0).sum()

def Percantage(x):
    return x.eq(0).mean()


df = test.groupby('ID').agg(label1=pd.NamedAgg(column='label', aggfunc='sum'),
                            label0=pd.NamedAgg(column='label', aggfunc=label0),
                            Percantage=pd.NamedAgg(column='label', aggfunc=Percantage))

Last if need column from index:

df = df.reset_index()
print (df)

      ID  label1  label0  Percantage
0  10105       2       1    0.333333
1  20205       2       2    0.500000

Upvotes: 4

maow
maow

Reputation: 2887

Use:

>>> test['label'].groupby(test['ID']).value_counts(normalize=True)
ID     label
10105  1        0.666667
       0        0.333333
20205  0        0.500000
       1        0.500000
Name: label, dtype: float64

@jezrael I did not know you can group a Series by another column. Neat

Upvotes: 1

Related Questions