Reputation: 71
I'm trying to find a way to utilize pandas drop_duplicates()
to recognize that rows are duplicates when the values are in reverse order.
An example is if I am trying to find transactions where customers purchases both apples and bananas, but the data collection order may have reversed the items. In other words, when combined as a full order the transaction is seen as a duplicate because it is made up up of the same items.
I want the following to be recognized as duplicates:
Item1 Item2
Apple Banana
Banana Apple
Upvotes: 3
Views: 1733
Reputation: 863206
First sort by rows with apply
sorted
and then drop_duplicates
:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
Item1 Item2
0 Apple Banana
#if need specify columns
cols = ['Item1','Item2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates(subset=cols)
print (df)
Item1 Item2
0 Apple Banana
Another solution with numpy.sort
and DataFrame
constructor:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
Item1 Item2
0 Apple Banana
Upvotes: 5