Reputation: 191
I have a pandas df like
df_test = pd.DataFrame({'A': 'a a a b b'.split(), 'B': ['Y','N','Y','Y','N']})
and my desired output to be
df_test2 = pd.DataFrame({'A': 'a b'.split(), 'B': [2/3,1/2]})
How would you do a groupby().apply by column A to get the percentage of 'Y' in column B?
I have been searching groupby.apply() but nothing have worked so far Thank you !
Upvotes: 5
Views: 16481
Reputation: 878
This is a generalized solution which doesn't alter the table or does any kind of filtering or transformation before using groupby.
> s = df_test.groupby(['A'])['B'].value_counts(normalize=True)
> print(s)
A B
a Y 0.666667
N 0.333333
b N 0.500000
Y 0.500000
Name: B, dtype: float64
Above variable s is a multi-index series and you can access any rows using .loc
> s.loc[:,'Y']
A
a 0.666667
b 0.500000
Name: B, dtype: float64
Similarly, you can access the details about 'N' using the same series.
> s.loc[:,'N']
A
a 0.333333
b 0.500000
Name: B, dtype: float64
PS: If you want to understand groupby better then try to decode this code which is exactly similar of above but only alters the column names and results differnetly.
> r = df_test.groupby(['B'])['A'].value_counts(normalize=True)
> print(r)
B A
N a 0.500000
b 0.500000
Y a 0.666667
b 0.333333
Name: A, dtype: float64
and
> r.loc['Y',:]
B A
Y a 0.666667
b 0.333333
Name: A, dtype: float64
Upvotes: 8
Reputation: 1556
personal favorite way:
df.column_name.value_counts() / len(df)
Gives a series with the column's values as the index and the proportions as the values.
Upvotes: 14
Reputation: 862691
Use GroupBy.mean
with boolean mask, where True
s are processes like 1
, no new column is necessary, because also is pass Series
df_test["A"]
to groupby
:
Notice:
Instead ==
is used eq
for cleaner syntax.
df = df_test["B"].eq('Y').groupby(df_test["A"]).mean().reset_index()
print (df)
A B
0 a 0.666667
1 b 0.500000
Upvotes: 5
Reputation: 18201
One approach could be
In [10]: df_test.groupby('A').B.apply(lambda x: (x == 'Y').mean())
Out[10]:
A
a 0.666667
b 0.500000
or, if you don't mind changing df_test
in the process,
In [15]: df_test['C'] = df_test.B == 'Y'
In [17]: df_test.groupby('A').C.mean()
Out[17]:
A
a 0.666667
b 0.500000
Name: C, dtype: float64
Upvotes: 4