pmonasterio
pmonasterio

Reputation: 17

Count the value of a column if is greater than 0 in a groupby result

I have a groupby array in which I need to group by A, then show a count of instances of B separated by B1 and B2 and finally the percentage of those instances that are > 0.1 so I did this to get the first 2:

    A   B       C   
id                                      
118 a1  B1      0
119 a1  B1      0
120 a1  B1      101.1   
121 a1  B1      106.67  
122 a1  B2      103.33  
237 a1  B2      100 

df = pd.DataFrame(df.groupby(
    ['A', 'B'])['B'].aggregate('count')).unstack(level=1)

to which I get the first part right:

                   B
B                  B1   B2
A
a1                  4   2
a2                  7   9
a3                  9   17
a4                  8   8
a5                  7   8

But then when I need to get the percentage of the count that is > 0

prcnt_complete = df[['A', 'B', 'C']]
prcnt_complete['passed'] = prcnt_complete['C'].apply(lambda x: (float(x) > 1))
prcnt_complete = prcnt_complete.groupby(['A', 'B', 'passed']).count()

I get weird values that make no sense, sometimes the sum between True and False doesn't even add up. I'm trying to understand what in the order of things I'm doing wrong so that I can make sense of it.

The result I'm looking for is something like this:

                   B         passed
B                  B1   B2   B1   B2
A
a1                  4   2    2    2
a2                  7   9    7    6
a3                  9   17   9    5

Thanks,

Upvotes: 0

Views: 962

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150745

You can do:

(df['C'].gt(1).groupby([df['A'],df['B']])
    .agg(['size','sum'])
    .rename(columns={'size':'B','sum':'passed'})
    .unstack('B')
)

Output (from sample data):

       B    passed   
B  B1 B2     B1 B2
A                 
a1  4  2      2  2

Upvotes: 1

Joe Ferndz
Joe Ferndz

Reputation: 8508

While working on your problem, I also wanted to see if I can get the average percentage for B (while ignoring 0s). I was able to accomplish this as well while getting the counts.

DataFrame for this exercise:

     A   B       C
0   a1  B1    0.00
1   a1  B1    0.00
2   a1  B1   98.87
3   a1  B1  101.10
4   a1  B2  106.67
5   a1  B2  103.00
6   a2  B1    0.00
7   a2  B1    0.00
8   a2  B1   33.00
9   a2  B1  100.00
10  a2  B2   80.00
11  a3  B1   90.00
12  a3  B2   99.00

Average while excluding the zeros

for this I had to add .replace(0, np.nan) before the groupby function.

A = ['a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a3','a3']
B = ['B1','B1','B1','B1','B2','B2','B1','B1','B1','B1','B2','B1','B2']
C = [0,0,98.87,101.1,106.67,103,0,0,33,100,80,90,99]
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':A,'B':B,'C':C})

df = pd.DataFrame(df.replace(0, np.nan)
                    .groupby(['A', 'B'])
                    .agg({'B':'size','C':['count','mean']})
                    .rename(columns={'size':'Count','count':'Passed','mean':'Avg Score'})).unstack(level=1)
df.columns = df.columns.droplevel(0)

   Count    Passed    Avg Score         
B     B1 B2     B1 B2        B1       B2
A                                       
a1     4  2      2  2    99.985  104.835
a2     4  1      2  1    66.500   80.000
a3     1  1      1  1    90.000   99.000

Upvotes: 0

Related Questions