Reputation: 253
I have a Pandas data frame where I am seeing duplicate rows, although they are not technically duplicated. The values are just arranged in a different order. I am trying to figure out how to remove the duplicate row without considering the order of the data.
See below for my example
ID1 Name1 ID2 Name2
1 Matt 2 John
2 John 1 Matt
3 Jeff 1 Matt
Expected Output
ID1 Name1 ID2 Name2
1 Matt 2 John
1 Matt 3 Jeff
Upvotes: 2
Views: 601
Reputation: 28709
Working on the premise that the order of the data does not matter :
Convert dataframe to string -> move into numpy land -> sort the array ... this will put numbers before alphabets -> return to pandas and drop duplicates
res = pd.DataFrame(np.sort(df.astype(str).to_numpy()),
columns = ["ID1","ID2","Name1","Name2"])
.drop_duplicates(["ID1","ID2"])
print(res)
ID1 ID2 Name1 Name2
0 1 2 John Matt
2 1 3 Jeff Matt
Upvotes: 0
Reputation: 28303
Switch the ids & names if ID1 > ID2. Then drop duplicates as usual.
df.loc[df.ID1 > df.ID2, df.columns] = df.loc[df.ID1 > df.ID2, df.columns[[2,3,0,1]]].values
df.drop_duplicates()
ID1 Name1 ID2 Name2
0 1 Matt 2 John
2 1 Matt 3 Jeff
Upvotes: 1
Reputation: 5768
This works albeit is a bit ugly: map both IDs and names into a single "uid" that will be the same when the ID1 of rowA is same as ID2 of rowB and vice versa, do likewise for name. Then groupby that "uid" (quotes because it isnt unique, but is desired to be unique). For groups of length>1, take the first row, concat those first rows with the groups of length 1.
df['multID']=df.apply(lambda r:sorted([r['ID1'],r['ID2']]),axis=1)
df['multName']=df.apply(lambda r:sorted([r['Name1'],r['Name2']]),axis=1)
df['uid']=df.apply(lambda r:str([r['multName'],r['multID']]),axis=1)
g=df.groupby('uid')
df2=pd.concat([g.filter(lambda x:len(x)>1).iloc[[0]],g.filter(lambda x:len(x)==1)],axis=0)
The list has to be converted to string , otherwise the filter throws 'unhashable type'.
Upvotes: 0