spectrum
spectrum

Reputation: 97

Count the number of unique values per group

I have 2 columns - _a, _b.

import numpy as np 
import pandas as pd
df = pd.DataFrame({'_a':[1,1,1,2,2,3,3],'_b':[3,4,5,3,3,3,9], 'a_b_3':[3,3,3,1,1,2,2]})
df

    _a  _b  a_b_3   
0   1   3   3
1   1   4   3
2   1   5   3
3   2   3   1
4   2   3   1
5   3   3   2
6   3   9   2

I need create column a_b_3 (unique count from column '_b') use groupby from pandas. Thank you in advance.

Upvotes: 0

Views: 589

Answers (3)

the_martian
the_martian

Reputation: 291

If I understand you correctly what you want is to group by column _a, count the number of unique values in column _b within each group and then append this count to the original dataframe using _a as the key. The following code should achieve that.

df.merge(pd.DataFrame(df.groupby('_a')._b.nunique()), left_on='_a', right_index=True)

Breaking it down, the first thing is to group by _a and then count the uniques in column _b. That's what df.groupby('_a')._b.nunique() does. Then it's merged with the original dataframe using _a as the key. The groupby returns a series so we need to convert it to a dataframe before merging, hence the pd.DataFrame.

EDIT

@COLDSPEED's answer above is much more efficient than this one. To give an idea of the speed difference I ran a timeit which shows a speed up of 2x on this small dataframe, on larger dataframes it would probably be even more.

Using merge:

%timeit df.merge(pd.DataFrame(df.groupby('_a')._b.nunique()), left_on='_a', right_index=True)
1.43 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using transform:

%timeit df.groupby('_a')['_b'].transform('nunique')
750 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 1

cs95
cs95

Reputation: 402263

Looks like you want transform + nunique;

df['a_b_3'] = df.groupby('_a')['_b'].transform('nunique')        
df
   _a  _b  a_b_3
0   1   3      3
1   1   4      3
2   1   5      3
3   2   3      1
4   2   3      1
5   3   3      2
6   3   9      2

This is effectively groupby + nunique + map:

v = df.groupby('_a')['_b'].nunique()
df['a_b_3'] = df['_a'].map(v)

df
   _a  _b  a_b_3
0   1   3      3
1   1   4      3
2   1   5      3
3   2   3      1
4   2   3      1
5   3   3      2
6   3   9      2

Upvotes: 5

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Use -

df2=df.groupby(['_a'])['_b'].nunique().reset_index()
df['a_b_3'] = df.merge(df2, how='left', on='_a')[['_b_y']]

Output

   _a  _b  a_b_3
0   1   3      3
1   1   4      3
2   1   5      3
3   2   3      1
4   2   3      1
5   3   3      2
6   3   9      2

Upvotes: 3

Related Questions