Reputation: 147
I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1
and df2
. I'm interested in finding the rows of df1
which are not present in df2
. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.
For example, i
th observation in df1
may be j
th observation in df2
and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).
A simple example of both data frame would be
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})
in this case the solution would be
df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})
Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2)
but it does not provide the accurate results that I want.
Moreover, I have also seen this question
but this assumes that one data frame is the subset of another which is not necessarily true in my case.
Upvotes: 1
Views: 118
Reputation: 164623
Here's one way:
import pandas as pd, numpy as np
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})
x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}
pd.DataFrame(list(x), columns=['col1', 'col2'])
If you have np.nan
data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].
The reason for the complication is that np.nan
== np.nan
is considered False
.
Upvotes: 2
Reputation: 323226
Here is on solution
pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]:
col1 col2
0 1 10.0
1 2 11.0
Upvotes: 1