Reputation: 680
I have two dataframes:
df1:
col1 col2
1 2
1 3
2 4
df2:
col1
2
3
I want to extract all the rows in df1
where df1
's col2
not in
df2
's col1
. So in this case it would be:
col1 col2
2 4
I first tried:
df1[df1['col2'] not in df2['col1']]
But it returned:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I then tried:
df1[df1['col2'] not in df2['col1'].tolist]
But it returned:
TypeError: argument of type 'instancemethod' is not iterable
Upvotes: 1
Views: 285
Reputation: 210842
using .query() method:
In [9]: df1.query('col2 not in @df2.col1')
Out[9]:
col1 col2
2 2 4
Timing for bigger DFs:
In [44]: df1.shape
Out[44]: (30000000, 2)
In [45]: df2.shape
Out[45]: (20000000, 1)
In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1 loop, best of 3: 5.56 s per loop
In [47]: %timeit (df1.query('col2 not in @df2.col1'))
1 loop, best of 3: 5.96 s per loop
Upvotes: 1
Reputation: 862641
You can use isin
with ~
for inverting boolean mask:
print (df1['col2'].isin(df2['col1']))
0 True
1 True
2 False
Name: col2, dtype: bool
print (~df1['col2'].isin(df2['col1']))
0 False
1 False
2 True
Name: col2, dtype: bool
print (df1[~df1['col2'].isin(df2['col1'])])
col1 col2
2 2 4
Timings:
In [8]: %timeit (df1.query('col2 not in @df2.col1'))
1000 loops, best of 3: 1.57 ms per loop
In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1000 loops, best of 3: 466 µs per loop
Upvotes: 1