How to merge two dfs which have duplicates in both

Question

I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1.

Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.

data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn)

Result of above snippet:

  Name  Age Sex
0  ABC    1   M
1  ABC    3   M
2  ABC    4   M
3  MNO    4   M
4  XYZ    2   M
5  XYZ    1   M
6  PQR    2   F

This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn

I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1.

Apologies as im not able to share the actual data as it is too large! Edit: A sample result from the actual data: df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:

Thanks in advance!

Corralien · Accepted Answer

Try to drop_duplicates from df1 too:

dfn = pd.merge(df1, df2.drop_duplicates('Name'),
               on='Name', how='left)

How to merge two dfs which have duplicates in both

Answers (1)

Related Questions