Reputation: 3
I have two datasets with identical column headers and I would like to remove ALL data that is 100% identical, and just have what they do not have exactly in common remaining. How could I go about doing that?
Thank you for your time!
Upvotes: 0
Views: 2151
Reputation: 105
To get everything BUT the intersection of two pandas datasets, try this:
# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
There is one caveat though, when filtering with boolean masks, your int values might turn into floats. By default, pandas replaces unwanted (False) values with the float version of NAN and converts the entire column to float. You can see this happening in the example below.
To circumvent this, explicitly declare the datatype when creating the dataframe.
import pandas as pd
df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")
df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")
# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
print(f"result\n{result}\n")
csv1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
csv2
A B C
0 1 2 3
1 4 5 6
2 10 11 12
result
A B C
0 7.0 8.0 9.0
1 10.0 11.0 12.0
Upvotes: 2