Pandas - Remove duplicates from two dataframes with different columns

Question

I have two dataframes: 1 major df and 1 with rows that I want to delete in the major one (dfmatch). The major df has more columns than the dfmatch.

I only want to delete the rows in major df if column1, column2 AND column3 equals with the value in the corresponinding columns of dfmatch.

Column extra1 and extra2 should be available in dfnew as well.

My current script only shows the column headers instead of the remaining rows:

file = 'testdf.csv'
colnames=['column1', 'column2', 'column3', 'extra1', 'extra2'] 
df = pd.read_csv(file, names=colnames, header=None)

file = 'testdfmatch.csv'
colnames=['column1', 'column2', 'column3'] 
dfmatch = pd.read_csv(file, names=colnames, header=None)

dfnew = pd.concat([dfmatch,df,df], sort=False).drop_duplicates(['column1', 'column2', 'column3'], keep=False)

wwnde · Accepted Answer

Sample data would have been useful. Lets try pd.merge, indicator=

dfnew  = pd.merge(df, dfmatch, how='left', indicator='Exist')
dfnew  = dfnew .loc[dfnew ['Exist'] != 'both']
dfnew.drop(columns=['Exist'], inplace=True) 
print(dfnew)

Pandas - Remove duplicates from two dataframes with different columns

Answers (2)

Related Questions