Reputation: 2269
I have the following dataframe
import pandas as pd
country = ['US', 'US', 'US', 'UK', 'UK', 'Canada', 'Canada', "Mexico"]
feature = [2, 2, 2, 1, 1, 2, 2, 1]
ID = [1, 2, 1, 3, 4, 1, 2, 1]
df = pd.DataFrame(list(zip(country,feature, ID)),
columns =['country', 'feature', 'ID'])
which is
country feature ID
0 US 2 1
1 US 2 2
2 US 2 1
3 UK 1 3
4 UK 1 4
5 Canada 2 1
6 Canada 2 2
7 Mexico 1 1
Given the above dataframe I want to perform groupby
on country
. I want to create a value count
for each country
with the following rule: If feature ==2
, count = number of unique entries under ID for that country
. If feature !=2
, count = number of total entries for that country
. So, the resulting new dataframe will look like this:
country count
0 US 2
1 UK 2
2 Canada 2
3 Mexico 1
Upvotes: 0
Views: 1204
Reputation: 26211
You can do it in two groupby
operations, which is actually faster than a single groupby
containing a lambda
to test feature
. Also, to solve the case where some countries may accidentally have a mix of feature == 2
and feature != 2
, we'll take the max:
a = df.groupby(['country', 'ID']).size().groupby('country').size()
b = df.groupby('country').agg({'feature': max, 'ID': 'size'})
out = b['ID'].where(b['feature'] != 2, a).to_frame('count')
>>> out
count
country
Canada 2
Mexico 1
UK 2
US 2
Speed comparison
# 1. setup
n = 100_000
ci = np.random.randint(0, 80, n)
df = pd.DataFrame({
'country': [f'c_{i}' for i in ci],
'feature': [(i % 2) + 1 for i in ci],
'ID': np.random.randint(0, 60, n),
})
# test
%timeit ours(df)
# 17.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit with_lambda(df)
# 34.7 ms ± 49.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 323226
Just do a simple condition within groupby
out = df.groupby('country').\
apply(lambda x :x['ID'].nunique() if x['feature'].eq(2).all() else len(x))
Out[649]:
country
Canada 2
Mexico 1
UK 2
US 2
dtype: int64
Upvotes: 2
Reputation: 23217
Assuming for each country, there is only a single value of feature
, you can use .nunique()
to get the unique count for feature==2
and use .size()
to get the total entries for feature==1
. Then, combine the results and sort by country:
df_eq = df.loc[df['feature'] == 2].groupby('country')['ID'].nunique().reset_index(name='count')
df_ne = df.loc[df['feature'] != 2].groupby('country')['ID'].size().reset_index(name='count')
df_out = df_eq.append(df_ne).sort_values('country').reset_index(drop=True)
Result:
print(df_out)
country count
0 Canada 2
1 Mexico 1
2 UK 2
3 US 2
Upvotes: 1
Reputation: 18416
Count for Canada
should be 2 in the expected output.
You can group the dataframe by country
, then apply a function to get either unique count, or just the count based on the value of feature
, then call to_frame, to create frame out of it.
(df.groupby('country', sort=False)
.apply(lambda x:pd.Series.nunique(x['ID'])
if x['feature'].eq(2).any()
else pd.Series.count(x['ID'])).to_frame('count')
)
count
country
US 2
UK 2
Canada 2
Mexico 1
Upvotes: 2