Tie_24
Tie_24

Reputation: 647

Condition based in strings contained in groups

             A      B   C
0   2002-01-12  Sarah  39
1   2002-01-12   John  17
2   2002-01-12  Susan  30
3   2002-01-15  Danny  12
4   2002-01-15  Peter  25
5   2002-01-15   John  25
6   2002-01-20   John  16
7   2002-01-20   Hung  10
8   2002-02-20   John  20
9   2002-02-20  Susan  40
10  2002-02-24  Rebel  40
11  2002-02-24  Susan  15
12  2002-02-24   Mark  38
13  2002-02-24  Susan  30

I want to select full A groups that contain John and Susan.

Output should be:

             A      B   C
0   2002-01-12  Sarah  39
1   2002-01-12   John  17
2   2002-01-12  Susan  30
6   2002-01-20   John  16
7   2002-01-20   Hung  10
8   2002-02-20   John  20
9   2002-02-20  Susan  40

I've tried:

df.groupby('A').apply(lambda x: ((df.B == x.John) & (df.B == x.Susan)))

Upvotes: 0

Views: 72

Answers (3)

ALollz
ALollz

Reputation: 59579

You can use groupby + transform to create a flag for the groups that satisfy that condition. And then you can mask the original df by that flag. If you don't want to modify the original df you can create a separate Series named flag, otherwise you can also just assign it to a column in your original df

import pandas as pd
# As Haleemur Ali points out, use x.values to make it faster
flag = df.groupby('A').B.transform(lambda x: (x.values == 'Susan').any() & (x.values == 'John').any())

Then you can filter the df

df[flag]
#            A      B   C
#0  2002-01-12  Sarah  39
#1  2002-01-12   John  17
#2  2002-01-12  Susan  30
#8  2002-02-20   John  20
#9  2002-02-20  Susan  40

Upvotes: 1

Haleemur Ali
Haleemur Ali

Reputation: 28303

create an array of dates as the intersection of dates that contain John & dates that contain Susan:

dates = np.intersect1d(
    df.A.values[df.B.values == 'John'], 
    df.A.values[df.B.values == 'Susan']
)

then use the dates array to filter the dataframe

df[df.A.isin(dates)]

# outputs:
            A      B   C
0  2002-01-12  Sarah  39
1  2002-01-12   John  17
2  2002-01-12  Susan  30
8  2002-02-20   John  20
9  2002-02-20  Susan  40

Timings:

Comparing the solutions provided by jpp, ALollz and mine above:

The numpy based solution is several times more efficient than the others.

In [288]: def hal(df):
     ...:     dates = np.intersect1d(
     ...:      df.A.values[df.B.values == 'John'], 
     ...:      df.A.values[df.B.values == 'Susan']
     ...:     )
     ...:     return df[df.A.isin(dates)]
     ...:

In [289]: def jpp(df):
     ...:     s = df.groupby('A')['B'].apply(set)
     ...:     return df[df['A'].map(s) >= {'John', 'Susan'}]
     ...:

In [290]: def alollz(df):
     ...:     flag = df.groupby('A').B.transform(lambda x: ((x=='Susan').any() & (x == 'John').any()).sum().astype('boo
     ...: l'))
     ...:     return df[flag==True]
     ...:

In [291]: %timeit hal(df)
394 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [292]: %timeit jpp(df)
1.46 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [293]: %timeit alollz(df)
4.9 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, the solution proposed by ALollz can be sped up 2X by omitting some extra unneeded operations and going down to numpy arrays for comparison.

In [294]: def alollz_improved(df):
     ...:     v = df.groupby('A').B.transform(lambda x: (x.values=='Susan').any() & (x.values=='John').any())
     ...:     return df[v]
     ...:

In [295]: %timeit alollz_improved(df)
2.2 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

jpp
jpp

Reputation: 164773

Create a series mapping each date to a set of names. Then use set.issuperset via syntactic sugar >=:

s = df.groupby('A')['B'].apply(set)

res = df[df['A'].map(s) >= {'John', 'Susan'}]

print(res)

            A      B   C
0  2002-01-12  Sarah  39
1  2002-01-12   John  17
2  2002-01-12  Susan  30
8  2002-02-20   John  20
9  2002-02-20  Susan  40

Upvotes: 1

Related Questions