Reputation: 17
I have a groupby array in which I need to group by A, then show a count of instances of B separated by B1 and B2 and finally the percentage of those instances that are > 0.1 so I did this to get the first 2:
A B C
id
118 a1 B1 0
119 a1 B1 0
120 a1 B1 101.1
121 a1 B1 106.67
122 a1 B2 103.33
237 a1 B2 100
df = pd.DataFrame(df.groupby(
['A', 'B'])['B'].aggregate('count')).unstack(level=1)
to which I get the first part right:
B
B B1 B2
A
a1 4 2
a2 7 9
a3 9 17
a4 8 8
a5 7 8
But then when I need to get the percentage of the count that is > 0
prcnt_complete = df[['A', 'B', 'C']]
prcnt_complete['passed'] = prcnt_complete['C'].apply(lambda x: (float(x) > 1))
prcnt_complete = prcnt_complete.groupby(['A', 'B', 'passed']).count()
I get weird values that make no sense, sometimes the sum between True and False doesn't even add up. I'm trying to understand what in the order of things I'm doing wrong so that I can make sense of it.
The result I'm looking for is something like this:
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
a2 7 9 7 6
a3 9 17 9 5
Thanks,
Upvotes: 0
Views: 962
Reputation: 150745
You can do:
(df['C'].gt(1).groupby([df['A'],df['B']])
.agg(['size','sum'])
.rename(columns={'size':'B','sum':'passed'})
.unstack('B')
)
Output (from sample data):
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
Upvotes: 1
Reputation: 8508
While working on your problem, I also wanted to see if I can get the average percentage for B (while ignoring 0s). I was able to accomplish this as well while getting the counts.
DataFrame for this exercise:
A B C
0 a1 B1 0.00
1 a1 B1 0.00
2 a1 B1 98.87
3 a1 B1 101.10
4 a1 B2 106.67
5 a1 B2 103.00
6 a2 B1 0.00
7 a2 B1 0.00
8 a2 B1 33.00
9 a2 B1 100.00
10 a2 B2 80.00
11 a3 B1 90.00
12 a3 B2 99.00
Average while excluding the zeros
for this I had to add .replace(0, np.nan)
before the groupby function.
A = ['a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a3','a3']
B = ['B1','B1','B1','B1','B2','B2','B1','B1','B1','B1','B2','B1','B2']
C = [0,0,98.87,101.1,106.67,103,0,0,33,100,80,90,99]
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':A,'B':B,'C':C})
df = pd.DataFrame(df.replace(0, np.nan)
.groupby(['A', 'B'])
.agg({'B':'size','C':['count','mean']})
.rename(columns={'size':'Count','count':'Passed','mean':'Avg Score'})).unstack(level=1)
df.columns = df.columns.droplevel(0)
Count Passed Avg Score
B B1 B2 B1 B2 B1 B2
A
a1 4 2 2 2 99.985 104.835
a2 4 1 2 1 66.500 80.000
a3 1 1 1 1 90.000 99.000
Upvotes: 0