Sos
Sos

Reputation: 1949

Pandas drop duplicates based on 2 columns sometimes reversed

I have a DF that looks something like

    c1    c2    c3
1   A     B     x
2   A     C     y
3   B     A     x
4   B     D     z
5   A     B     y

As you can see, lines 1 and 3 are repeated if we disregard that c1 and c2 are different columns (or if they become reversed). However, line 5 is not. How can I drop rows based on columns c1 and c2, regardless of where the repeated values are?

Thanks in advance

Upvotes: 2

Views: 679

Answers (3)

Loochie
Loochie

Reputation: 2472

Also it could be done by sorting the row values using sorted():

df[~df[['c1','c2']].apply(lambda row: sorted(row), axis = 1).duplicated()]

Upvotes: 0

jezrael
jezrael

Reputation: 863216

You can select columns by subset, sorting by numpy.sort,create new DataFrame from array and use DataFrame.duplicated with filtering by inverse condition by boolean indexing:

df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1), index=df.index).duplicated()]
print (df)
  c1 c2 c3
1  A  B  x
2  A  C  y
4  B  D  z

Or:

df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1)).duplicated().values]

Upvotes: 3

BENY
BENY

Reputation: 323326

Ok let us try something new frozenset, will order your column in to sorted tuple , then using the duplicated

df[~df[['c1','c2']].apply(frozenset,axis=1).duplicated()]
Out[666]: 
  c1 c2 c3
1  A  B  x
2  A  C  y
4  B  D  z

Upvotes: 4

Related Questions