Reputation: 3739
Suppose I have a pandas dataframe such as:
id | name
1 foo
1 bar
2 foo
3 bar
My goal is to get the % of distinct ids that have both 'foo' and 'bar'
So far I've attempted df.groupby('id').name.nunique()
, I don't want a count of unique names, I want to search for foo and bar specifically.
Can someone help me out?
Upvotes: 0
Views: 269
Reputation: 8996
names_per_id = df.groupby('id')['name'].unique()
gets you a Series of lists, with:
id
1 [foo, bar]
2 [foo]
3 [bar]
from there we can get the ids which have both foo and bar:
foo_and_bar = names_per_id.apply(lambda row: ('foo' in row) and ('bar' in row))
1 True
2 False
3 False
finally, calculating the percentage you asked for:
pct = foo_and_bar.mean()
0.33333333333333331
Upvotes: 2
Reputation: 62007
Here is a general solution. Just replace the ['foo', 'bar'] list with a custom list of your choice.
res = df.groupby('id')['name'].agg(lambda s: np.all(np.in1d(['foo', 'bar'], s)))
Which outputs this:
name
id
1 True
2 False
3 False
Then just take the mean of this Series
res.mean()
Output
name 0.333333
dtype: float64
To keep it all in pandas do:
df.groupby('id')['name'].agg(lambda s: pd.Series(['foo', 'bar']).isin(s).all()).mean()
Upvotes: 1