Reputation: 59
I need to find matches in a column of the same dataframe, that I'm doing is copying the dataframe and making a merge between de dataframe and its copy, but there is a way to avoid duplicates when 2 columns are equal and when the same result is shown previously. For Example:
df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']
df2 = df1
df3 = pd.merge(df1,df2,on='Tel',how='inner')
The result is the following:
Id_x Tel Id_y
0 001 123 001
1 001 123 004
2 001 123 006
3 004 123 001
4 004 123 004
5 004 123 006
6 006 123 001
7 006 123 004
8 006 123 006
9 002 456 002
10 003 789 003
11 005 852 005
But I want the following result:
Id_x Tel Id_y
0 001 123 004
1 001 123 006
2 004 123 006
As you can see I need to ignore the results when Id_x == Id_y, but also I need to ignore when the same result is shown previously in different order, for example in the first result index 1 is the same result of the index 3, index 2 is the same result of index 6 and index 5 is the same result of index 7. So in the final result I just want index 1, index 2 and index 5.
Is there a way to do that?
Thank you so much!
Upvotes: 1
Views: 441
Reputation: 120509
Create tuples from Id_x
and Id_y
then sorted them and drop duplicates:
>>> df3[df3[['Id_x', 'Id_y']].apply(lambda x: sorted(tuple(x)), axis=1)
.duplicated(keep='last')]
Id_x Tel Id_y
1 001 123 004
2 001 123 006
5 004 123 006
Update
Can you help me in the cases where a Id is not in both columns? For example suppose that df2 has an aditional row with Id: 007 and Tel: 852, this Id merge with Id: 005, but when I apply the tuple in df3 the Id: 007 is deleted
df3 = pd.merge(df1,df2,on='Tel',how='inner').query('Id_x != Id_y')
df3 = df3[~df3[['Id_x', 'Id_y']].apply(lambda x: sorted(tuple(x)), axis=1)
.duplicated(keep='first')]
print(df3)
# Output:
Id_x Tel Id_y
1 001 123 004
2 001 123 006
5 004 123 006
12 005 852 007
Setup:
df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']
df2 = df1.copy()
df2 = df2.append(pd.DataFrame({'Id': ['007'], 'Tel': ['852']}))
Upvotes: 1
Reputation: 1
Sort of a roundabout solution but this should get you the answer you are looking for
df1 = pd.DataFrame()
df1['Id'] = ['001','002','003','004','005','006']
df1['Tel'] = ['123','456','789','123','852','123']
df1 = df1.drop_duplicates()
df2 = df1
df3 = pd.merge(df1,df2,on='Tel',how='inner')
df3 = df3[df3['Id_x'] != df3['Id_y']]
Upvotes: 0