Tom Benson
Tom Benson

Reputation: 125

Finding duplicates in two dataframes and removing the duplicates from one dataframe

Working in Python / pandas / dataframes

I have these two dataframes:

Dataframe one:

          1          2    3 
 1   Stockholm     100    250
 2   Stockholm     150    376
 3   Stockholm     105    235
 4   Stockholm     109    104
 5   Burnley       145    234
 6   Burnley       100    250

Dataframe two:

          1          2    3 
 1   Stockholm     100    250
 2   Stockholm     117    128
 3   Stockholm     105    235
 4   Stockholm     100    250
 5   Burnley       145    234
 6   Burnley       100    953

And I would like to find the duplicate rows found in Dataframe one and Dataframe two and remove the duplicates from Dataframe one. As in data frame two, you can find row 1, 3, 5 in data frame one, which would remove them from data frame on and create the below:

     1           2       3 
1    Stockholm   150     376
2    Stockholm   109     104
3    Burnley     100     250

Upvotes: 11

Views: 22298

Answers (2)

Dimitris Paraschakis
Dimitris Paraschakis

Reputation: 651

A one-liner:

df1.merge(df2, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)

Output:

           1    2    3
2  Stockholm  150  376
4  Stockholm  109  104
6    Burnley  100  250

Upvotes: 5

Ji Wei
Ji Wei

Reputation: 881

Use:

df_merge = pd.merge(df1, df2, on=[1,2,3], how='inner')
df1 = df1.append(df_merge) 

df1['Duplicated'] = df1.duplicated(keep=False) # keep=False marks the duplicated row with a True
df_final = df1[~df1['Duplicated']] # selects only rows which are not duplicated.
del df_final['Duplicated'] # delete the indicator column

The idea is as follows:

  1. do a inner join on all the columns
  2. append the output of the inner join to df1
  3. identify the duplicated rows in df1
  4. select the not duplicated rows in df1

Each number corresponds to each line of code.

Upvotes: 11

Related Questions