Sibghat Khan
Sibghat Khan

Reputation: 147

Check which rows of pandas exist in another

I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.

For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).

A simple example of both data frame would be

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50}) 

df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})

in this case the solution would be

df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})

Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.

Moreover, I have also seen this question

but this assumes that one data frame is the subset of another which is not necessarily true in my case.

Upvotes: 1

Views: 118

Answers (2)

jpp
jpp

Reputation: 164623

Here's one way:

import pandas as pd, numpy as np

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})

x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}

pd.DataFrame(list(x), columns=['col1', 'col2'])

If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].

The reason for the complication is that np.nan == np.nan is considered False.

Upvotes: 2

BENY
BENY

Reputation: 323226

Here is on solution

pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]: 
   col1  col2
0     1  10.0
1     2  11.0

Upvotes: 1

Related Questions