Reputation: 97
I have 2 columns - _a, _b.
import numpy as np
import pandas as pd
df = pd.DataFrame({'_a':[1,1,1,2,2,3,3],'_b':[3,4,5,3,3,3,9], 'a_b_3':[3,3,3,1,1,2,2]})
df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2
I need create column a_b_3 (unique count from column '_b') use groupby from pandas. Thank you in advance.
Upvotes: 0
Views: 589
Reputation: 291
If I understand you correctly what you want is to group by column _a, count the number of unique values in column _b within each group and then append this count to the original dataframe using _a as the key. The following code should achieve that.
df.merge(pd.DataFrame(df.groupby('_a')._b.nunique()), left_on='_a', right_index=True)
Breaking it down, the first thing is to group by _a and then count the uniques in column _b. That's what df.groupby('_a')._b.nunique()
does. Then it's merged with the original dataframe using _a as the key. The groupby returns a series so we need to convert it to a dataframe before merging, hence the pd.DataFrame
.
EDIT
@COLDSPEED's answer above is much more efficient than this one. To give an idea of the speed difference I ran a timeit which shows a speed up of 2x on this small dataframe, on larger dataframes it would probably be even more.
Using merge:
%timeit df.merge(pd.DataFrame(df.groupby('_a')._b.nunique()), left_on='_a', right_index=True)
1.43 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using transform:
%timeit df.groupby('_a')['_b'].transform('nunique')
750 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 1
Reputation: 402263
Looks like you want transform
+ nunique
;
df['a_b_3'] = df.groupby('_a')['_b'].transform('nunique')
df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2
This is effectively groupby
+ nunique
+ map
:
v = df.groupby('_a')['_b'].nunique()
df['a_b_3'] = df['_a'].map(v)
df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2
Upvotes: 5
Reputation: 9081
Use -
df2=df.groupby(['_a'])['_b'].nunique().reset_index()
df['a_b_3'] = df.merge(df2, how='left', on='_a')[['_b_y']]
Output
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2
Upvotes: 3