melatonin15
melatonin15

Reputation: 2269

pandas: groupby with conditional formula and unique values

I have the following dataframe

import pandas as pd
  
country = ['US', 'US', 'US', 'UK', 'UK', 'Canada', 'Canada', "Mexico"]
feature =  [2, 2, 2, 1, 1, 2, 2, 1]
ID = [1, 2, 1, 3, 4, 1, 2, 1]  

df = pd.DataFrame(list(zip(country,feature, ID)),
               columns =['country', 'feature', 'ID'])

which is

   country  feature ID
0   US        2     1
1   US        2     2
2   US        2     1
3   UK        1     3
4   UK        1     4
5   Canada    2     1
6   Canada    2     2
7   Mexico    1     1

Given the above dataframe I want to perform groupby on country. I want to create a value count for each country with the following rule: If feature ==2, count = number of unique entries under ID for that country. If feature !=2, count = number of total entries for that country. So, the resulting new dataframe will look like this:

  country   count
0   US        2
1   UK        2
2   Canada    2
3   Mexico    1

Upvotes: 0

Views: 1204

Answers (4)

Pierre D
Pierre D

Reputation: 26211

You can do it in two groupby operations, which is actually faster than a single groupby containing a lambda to test feature. Also, to solve the case where some countries may accidentally have a mix of feature == 2 and feature != 2, we'll take the max:

a = df.groupby(['country', 'ID']).size().groupby('country').size()
b = df.groupby('country').agg({'feature': max, 'ID': 'size'})
out = b['ID'].where(b['feature'] != 2, a).to_frame('count')

>>> out
         count
country       
Canada       2
Mexico       1
UK           2
US           2

Speed comparison

# 1. setup
n = 100_000
ci = np.random.randint(0, 80, n)
df = pd.DataFrame({
    'country': [f'c_{i}' for i in ci],
    'feature': [(i % 2) + 1 for i in ci],
    'ID': np.random.randint(0, 60, n),
})
# test
%timeit ours(df)
# 17.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit with_lambda(df)
# 34.7 ms ± 49.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

BENY
BENY

Reputation: 323226

Just do a simple condition within groupby

out = df.groupby('country').\
           apply(lambda x :x['ID'].nunique() if x['feature'].eq(2).all() else len(x))
Out[649]: 
country
Canada    2
Mexico    1
UK        2
US        2
dtype: int64

Upvotes: 2

SeaBean
SeaBean

Reputation: 23217

Assuming for each country, there is only a single value of feature, you can use .nunique() to get the unique count for feature==2 and use .size() to get the total entries for feature==1. Then, combine the results and sort by country:

df_eq = df.loc[df['feature'] == 2].groupby('country')['ID'].nunique().reset_index(name='count')
df_ne = df.loc[df['feature'] != 2].groupby('country')['ID'].size().reset_index(name='count')

df_out = df_eq.append(df_ne).sort_values('country').reset_index(drop=True)

Result:

print(df_out)

  country  count
0  Canada      2
1  Mexico      1
2      UK      2
3      US      2

Upvotes: 1

ThePyGuy
ThePyGuy

Reputation: 18416

Count for Canada should be 2 in the expected output.

You can group the dataframe by country, then apply a function to get either unique count, or just the count based on the value of feature, then call to_frame, to create frame out of it.

(df.groupby('country', sort=False)
.apply(lambda x:pd.Series.nunique(x['ID']) 
                    if x['feature'].eq(2).any() 
                    else pd.Series.count(x['ID'])).to_frame('count')
 )

         count
country       
US           2
UK           2
Canada       2
Mexico       1

Upvotes: 2

Related Questions