Reputation: 647
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
3 2002-01-15 Danny 12
4 2002-01-15 Peter 25
5 2002-01-15 John 25
6 2002-01-20 John 16
7 2002-01-20 Hung 10
8 2002-02-20 John 20
9 2002-02-20 Susan 40
10 2002-02-24 Rebel 40
11 2002-02-24 Susan 15
12 2002-02-24 Mark 38
13 2002-02-24 Susan 30
I want to select full A
groups that contain John
and Susan
.
Output should be:
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
6 2002-01-20 John 16
7 2002-01-20 Hung 10
8 2002-02-20 John 20
9 2002-02-20 Susan 40
I've tried:
df.groupby('A').apply(lambda x: ((df.B == x.John) & (df.B == x.Susan)))
Upvotes: 0
Views: 72
Reputation: 59579
You can use groupby
+ transform
to create a flag for the groups that satisfy that condition. And then you can mask the original df
by that flag. If you don't want to modify the original df
you can create a separate Series
named flag
, otherwise you can also just assign it to a column in your original df
import pandas as pd
# As Haleemur Ali points out, use x.values to make it faster
flag = df.groupby('A').B.transform(lambda x: (x.values == 'Susan').any() & (x.values == 'John').any())
Then you can filter the df
df[flag]
# A B C
#0 2002-01-12 Sarah 39
#1 2002-01-12 John 17
#2 2002-01-12 Susan 30
#8 2002-02-20 John 20
#9 2002-02-20 Susan 40
Upvotes: 1
Reputation: 28303
create an array of dates as the intersection of dates that contain John
& dates that contain Susan
:
dates = np.intersect1d(
df.A.values[df.B.values == 'John'],
df.A.values[df.B.values == 'Susan']
)
then use the dates array to filter the dataframe
df[df.A.isin(dates)]
# outputs:
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
8 2002-02-20 John 20
9 2002-02-20 Susan 40
Comparing the solutions provided by jpp, ALollz and mine above:
The numpy based solution is several times more efficient than the others.
In [288]: def hal(df):
...: dates = np.intersect1d(
...: df.A.values[df.B.values == 'John'],
...: df.A.values[df.B.values == 'Susan']
...: )
...: return df[df.A.isin(dates)]
...:
In [289]: def jpp(df):
...: s = df.groupby('A')['B'].apply(set)
...: return df[df['A'].map(s) >= {'John', 'Susan'}]
...:
In [290]: def alollz(df):
...: flag = df.groupby('A').B.transform(lambda x: ((x=='Susan').any() & (x == 'John').any()).sum().astype('boo
...: l'))
...: return df[flag==True]
...:
In [291]: %timeit hal(df)
394 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [292]: %timeit jpp(df)
1.46 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [293]: %timeit alollz(df)
4.9 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However, the solution proposed by ALollz can be sped up 2X by omitting some extra unneeded operations and going down to numpy arrays for comparison.
In [294]: def alollz_improved(df):
...: v = df.groupby('A').B.transform(lambda x: (x.values=='Susan').any() & (x.values=='John').any())
...: return df[v]
...:
In [295]: %timeit alollz_improved(df)
2.2 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2
Reputation: 164773
Create a series mapping each date to a set
of names. Then use set.issuperset
via syntactic sugar >=
:
s = df.groupby('A')['B'].apply(set)
res = df[df['A'].map(s) >= {'John', 'Susan'}]
print(res)
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
8 2002-02-20 John 20
9 2002-02-20 Susan 40
Upvotes: 1