Danny David Leybzon
Danny David Leybzon

Reputation: 680

Conditionally extracting Pandas rows based on another Pandas dataframe

I have two dataframes:

df1:

col1    col2
1       2
1       3
2       4

df2:

col1
2
3

I want to extract all the rows in df1 where df1's col2 not in df2's col1. So in this case it would be:

col1    col2
2       4

I first tried:

df1[df1['col2'] not in df2['col1']]

But it returned:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

I then tried:

df1[df1['col2'] not in df2['col1'].tolist]

But it returned:

TypeError: argument of type 'instancemethod' is not iterable

Upvotes: 1

Views: 285

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

using .query() method:

In [9]: df1.query('col2 not in @df2.col1')
Out[9]:
   col1  col2
2     2     4

Timing for bigger DFs:

In [44]: df1.shape
Out[44]: (30000000, 2)

In [45]: df2.shape
Out[45]: (20000000, 1)

In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1 loop, best of 3: 5.56 s per loop

In [47]: %timeit (df1.query('col2 not in @df2.col1'))
1 loop, best of 3: 5.96 s per loop

Upvotes: 1

jezrael
jezrael

Reputation: 862641

You can use isin with ~ for inverting boolean mask:

print (df1['col2'].isin(df2['col1']))
0     True
1     True
2    False
Name: col2, dtype: bool

print (~df1['col2'].isin(df2['col1']))
0    False
1    False
2     True
Name: col2, dtype: bool

print (df1[~df1['col2'].isin(df2['col1'])])
   col1  col2
2     2     4

Timings:

In [8]: %timeit (df1.query('col2 not in @df2.col1'))
1000 loops, best of 3: 1.57 ms per loop

In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1000 loops, best of 3: 466 µs per loop

Upvotes: 1

Related Questions