Finding duplicates in two dataframes and removing the duplicates from one dataframe

Question

Working in Python / pandas / dataframes

I have these two dataframes:

Dataframe one:

          1          2    3 
 1   Stockholm     100    250
 2   Stockholm     150    376
 3   Stockholm     105    235
 4   Stockholm     109    104
 5   Burnley       145    234
 6   Burnley       100    250

Dataframe two:

          1          2    3 
 1   Stockholm     100    250
 2   Stockholm     117    128
 3   Stockholm     105    235
 4   Stockholm     100    250
 5   Burnley       145    234
 6   Burnley       100    953

And I would like to find the duplicate rows found in Dataframe one and Dataframe two and remove the duplicates from Dataframe one. As in data frame two, you can find row 1, 3, 5 in data frame one, which would remove them from data frame on and create the below:

     1           2       3 
1    Stockholm   150     376
2    Stockholm   109     104
3    Burnley     100     250

Ji Wei · Accepted Answer

Use:

df_merge = pd.merge(df1, df2, on=[1,2,3], how='inner')
df1 = df1.append(df_merge) 

df1['Duplicated'] = df1.duplicated(keep=False) # keep=False marks the duplicated row with a True
df_final = df1[~df1['Duplicated']] # selects only rows which are not duplicated.
del df_final['Duplicated'] # delete the indicator column

The idea is as follows:

do a inner join on all the columns
append the output of the inner join to df1
identify the duplicated rows in df1
select the not duplicated rows in df1

Each number corresponds to each line of code.

Finding duplicates in two dataframes and removing the duplicates from one dataframe

Answers (2)

Related Questions