Reputation: 133
import pandas as pd
group = ['A']*5 + ['B']*4 + ['C']*3
color = ['red', 'red', 'green', 'blue', 'green', 'red'] * 2
df = pd.DataFrame({'group':group, 'color':color})
df
group color
0 A red
1 A red
2 A green
3 A blue
4 A green
5 B red
6 B red
7 B red
8 B green
9 C blue
10 C green
11 C red
What I want is to add new column to this dataframe which will show frequency of each color in a group (as a part of whole). Values for this column can be obtained with following code:
df.groupby('group').color.value_counts(normalize=True)
group color
A green 0.400000
red 0.400000
blue 0.200000
B red 0.750000
green 0.250000
C blue 0.333333
green 0.333333
red 0.333333
Name: color, dtype: float64
However I cannot figure out how to turn it into a column corresponding to original dataframe. I've tried
df['freq'] = df.groupby('group').color.transform('value_counts(normalize=True)') # error
which doesn't work. Here is the output I am looking for:
group color freq
0 A red 0.4
1 A red 0.4
2 A green 0.4
3 A blue 0.2
4 A green 0.4
5 B red 0.75
6 B red 0.75
7 B red 0.75
8 B green 0.25
9 C blue 0.33
10 C green 0.33
11 C red 0.33
Upvotes: 2
Views: 721
Reputation: 35626
Alternatively join
counts on group
and color
:
counts = df.groupby('group')['color'].value_counts(normalize=True)
df = df.join(counts.rename('freq'), on=['group', 'color'])
group color freq
0 A red 0.400000
1 A red 0.400000
2 A green 0.400000
3 A blue 0.200000
4 A green 0.400000
5 B red 0.750000
6 B red 0.750000
7 B red 0.750000
8 B green 0.250000
9 C blue 0.333333
10 C green 0.333333
11 C red 0.333333
Or calculate normalized value counts manually with counting group
+ color
counts vs group
counts via groupby transform
:
df['freq'] = (
df.groupby(['group', 'color'])['color'].transform('count') /
df.groupby('group')['group'].transform('count')
)
group color freq
0 A red 0.400000
1 A red 0.400000
2 A green 0.400000
3 A blue 0.200000
4 A green 0.400000
5 B red 0.750000
6 B red 0.750000
7 B red 0.750000
8 B green 0.250000
9 C blue 0.333333
10 C green 0.333333
11 C red 0.333333
Upvotes: 3
Reputation: 323226
Fix it by MultiIndex
and reindex
idx = pd.MultiIndex.from_frame(df[['group' , 'color']])
df['new'] = df.groupby('group').color.value_counts(normalize=True).reindex(idx).values
df
Out[115]:
group color new
0 A red 0.400000
1 A red 0.400000
2 A green 0.400000
3 A blue 0.200000
4 A green 0.400000
5 B red 0.750000
6 B red 0.750000
7 B red 0.750000
8 B green 0.250000
9 C blue 0.333333
10 C green 0.333333
11 C red 0.333333
Upvotes: 3