Tung
Tung

Reputation: 191

pandas-percentage count of categorical variable

I have a pandas df like

df_test = pd.DataFrame({'A': 'a a a b b'.split(), 'B': ['Y','N','Y','Y','N']})

and my desired output to be df_test2 = pd.DataFrame({'A': 'a b'.split(), 'B': [2/3,1/2]}) How would you do a groupby().apply by column A to get the percentage of 'Y' in column B?

I have been searching groupby.apply() but nothing have worked so far Thank you !

Upvotes: 5

Views: 16481

Answers (4)

Rohit Nandi
Rohit Nandi

Reputation: 878

This is a generalized solution which doesn't alter the table or does any kind of filtering or transformation before using groupby.

> s = df_test.groupby(['A'])['B'].value_counts(normalize=True)
> print(s)

A  B
a  Y    0.666667
   N    0.333333
b  N    0.500000
   Y    0.500000
Name: B, dtype: float64

Above variable s is a multi-index series and you can access any rows using .loc

> s.loc[:,'Y']
A
a    0.666667
b    0.500000
Name: B, dtype: float64

Similarly, you can access the details about 'N' using the same series.

> s.loc[:,'N']
A
a    0.333333
b    0.500000
Name: B, dtype: float64

PS: If you want to understand groupby better then try to decode this code which is exactly similar of above but only alters the column names and results differnetly.

> r = df_test.groupby(['B'])['A'].value_counts(normalize=True)
> print(r)
B  A
N  a    0.500000
   b    0.500000
Y  a    0.666667
   b    0.333333
Name: A, dtype: float64

and

> r.loc['Y',:]
B  A
Y  a    0.666667
   b    0.333333
Name: A, dtype: float64

Upvotes: 8

Freestyle076
Freestyle076

Reputation: 1556

personal favorite way:

df.column_name.value_counts() / len(df)

Gives a series with the column's values as the index and the proportions as the values.

Upvotes: 14

jezrael
jezrael

Reputation: 862691

Use GroupBy.mean with boolean mask, where Trues are processes like 1, no new column is necessary, because also is pass Series df_test["A"] to groupby:

Notice:

Instead == is used eq for cleaner syntax.

df = df_test["B"].eq('Y').groupby(df_test["A"]).mean().reset_index()
print (df)
   A         B
0  a  0.666667
1  b  0.500000

Upvotes: 5

fuglede
fuglede

Reputation: 18201

One approach could be

In [10]: df_test.groupby('A').B.apply(lambda x: (x == 'Y').mean())
Out[10]:
A
a    0.666667
b    0.500000

or, if you don't mind changing df_test in the process,

In [15]: df_test['C'] = df_test.B == 'Y'
In [17]: df_test.groupby('A').C.mean()
Out[17]:
A
a    0.666667
b    0.500000
Name: C, dtype: float64

Upvotes: 4

Related Questions