Avoid Duplicates Pandas Merge Between a Dataframe and its Copy

I need to find matches in a column of the same dataframe, that I'm doing is copying the dataframe and making a merge between de dataframe and its copy, but there is a way to avoid duplicates when 2 columns are equal and when the same result is shown previously. For Example:

df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']

df2 = df1

df3 = pd.merge(df1,df2,on='Tel',how='inner')

The result is the following:

    Id_x Tel Id_y
0   001 123 001
1   001 123 004
2   001 123 006
3   004 123 001
4   004 123 004
5   004 123 006
6   006 123 001
7   006 123 004
8   006 123 006
9   002 456 002
10  003 789 003
11  005 852 005

But I want the following result:

    Id_x Tel Id_y
0   001 123 004
1   001 123 006
2   004 123 006

As you can see I need to ignore the results when Id_x == Id_y, but also I need to ignore when the same result is shown previously in different order, for example in the first result index 1 is the same result of the index 3, index 2 is the same result of index 6 and index 5 is the same result of index 7. So in the final result I just want index 1, index 2 and index 5.

Is there a way to do that?

Thank you so much!

Upvotes: 1

Views: 441

Answers (2)

Corralien
Corralien

Reputation: 120509

Create tuples from Id_x and Id_y then sorted them and drop duplicates:

>>> df3[df3[['Id_x', 'Id_y']].apply(lambda x: sorted(tuple(x)), axis=1) 
                             .duplicated(keep='last')]

  Id_x  Tel Id_y
1  001  123  004
2  001  123  006
5  004  123  006

Update

Can you help me in the cases where a Id is not in both columns? For example suppose that df2 has an aditional row with Id: 007 and Tel: 852, this Id merge with Id: 005, but when I apply the tuple in df3 the Id: 007 is deleted

df3 = pd.merge(df1,df2,on='Tel',how='inner').query('Id_x != Id_y')
df3 = df3[~df3[['Id_x', 'Id_y']].apply(lambda x: sorted(tuple(x)), axis=1)
                                .duplicated(keep='first')]
print(df3)

# Output:
   Id_x  Tel Id_y
1   001  123  004
2   001  123  006
5   004  123  006
12  005  852  007

Setup:

df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']

df2 = df1.copy()
df2 = df2.append(pd.DataFrame({'Id': ['007'], 'Tel': ['852']}))

Upvotes: 1

stutterjones
stutterjones

Reputation: 1

Sort of a roundabout solution but this should get you the answer you are looking for

df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']

df1 = df1.drop_duplicates()
df2 = df1
df3 = pd.merge(df1,df2,on='Tel',how='inner')
df3 = df3[df3['Id_x'] != df3['Id_y']]

Upvotes: 0

Related Questions