Matthew Metros
Matthew Metros

Reputation: 253

How to drop a duplicate row in Pandas when the duplicate is in a different order?

I have a Pandas data frame where I am seeing duplicate rows, although they are not technically duplicated. The values are just arranged in a different order. I am trying to figure out how to remove the duplicate row without considering the order of the data.

See below for my example

ID1   Name1      ID2      Name2
  1    Matt        2       John
  2    John        1       Matt
  3    Jeff        1       Matt

Expected Output

ID1    Name1      ID2     Name2
  1     Matt        2      John
  1     Matt        3      Jeff

Upvotes: 2

Views: 601

Answers (3)

sammywemmy
sammywemmy

Reputation: 28709

Working on the premise that the order of the data does not matter :

Convert dataframe to string -> move into numpy land -> sort the array ... this will put numbers before alphabets -> return to pandas and drop duplicates

res = pd.DataFrame(np.sort(df.astype(str).to_numpy()),
                           columns = ["ID1","ID2","Name1","Name2"])
                  .drop_duplicates(["ID1","ID2"])

print(res)


   ID1  ID2 Name1   Name2
0   1   2   John    Matt
2   1   3   Jeff    Matt

Upvotes: 0

Haleemur Ali
Haleemur Ali

Reputation: 28303

Switch the ids & names if ID1 > ID2. Then drop duplicates as usual.

df.loc[df.ID1 > df.ID2, df.columns] = df.loc[df.ID1 > df.ID2, df.columns[[2,3,0,1]]].values
df.drop_duplicates()
   ID1 Name1  ID2 Name2
0    1  Matt    2  John
2    1  Matt    3  Jeff

Upvotes: 1

jeremy_rutman
jeremy_rutman

Reputation: 5768

This works albeit is a bit ugly: map both IDs and names into a single "uid" that will be the same when the ID1 of rowA is same as ID2 of rowB and vice versa, do likewise for name. Then groupby that "uid" (quotes because it isnt unique, but is desired to be unique). For groups of length>1, take the first row, concat those first rows with the groups of length 1.

df['multID']=df.apply(lambda r:sorted([r['ID1'],r['ID2']]),axis=1)
df['multName']=df.apply(lambda r:sorted([r['Name1'],r['Name2']]),axis=1)
df['uid']=df.apply(lambda r:str([r['multName'],r['multID']]),axis=1)
g=df.groupby('uid')  
df2=pd.concat([g.filter(lambda x:len(x)>1).iloc[[0]],g.filter(lambda x:len(x)==1)],axis=0)

The list has to be converted to string , otherwise the filter throws 'unhashable type'.

Upvotes: 0

Related Questions