Kquinn86
Kquinn86

Reputation: 3

How to do a full outer join excluding the intersection between two pandas dataframes?

I have two datasets with identical column headers and I would like to remove ALL data that is 100% identical, and just have what they do not have exactly in common remaining. How could I go about doing that?

enter image description here

Thank you for your time!

Upvotes: 0

Views: 2151

Answers (1)

pbsb
pbsb

Reputation: 105

To get everything BUT the intersection of two pandas datasets, try this:

# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]

# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]

# concatenate and drop NANs
result = pd.concat(
    [r1, r2]
).dropna().reset_index(drop=True)

There is one caveat though, when filtering with boolean masks, your int values might turn into floats. By default, pandas replaces unwanted (False) values with the float version of NAN and converts the entire column to float. You can see this happening in the example below.

To circumvent this, explicitly declare the datatype when creating the dataframe.

Example

import pandas as pd

df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")

df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")

# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]

# concatenate and drop NANs
result = pd.concat(
    [r1, r2]
).dropna().reset_index(drop=True)

print(f"result\n{result}\n")

Input

csv1
   A   B   C
0  1   2   3
1  4   5   6
2  7   8   9

csv2
    A   B   C
0   1   2   3
1   4   5   6
2  10  11  12

Output

result
      A     B     C
0   7.0   8.0   9.0
1  10.0  11.0  12.0

Upvotes: 2

Related Questions