Reputation: 2022

Remove duplicates based on the content of two columns not the order

I have a correlation matrix that i melted into a dataframe so now i have the following for example:

First      Second       Value
A          B            0.5
B          A            0.5
A          C            0.2

i want to delete only one of the first two rows. What would be the way to do it?

Upvotes: 0

Answers (3)

rnso

Reputation: 24535

One can also use following approach:

# create a new column after merging and sorting 'First' and 'Second':
df['newcol']=df.apply(lambda x: "".join(sorted(x[0]+x[1])), axis=1)
print(df)

  First Second  Value newcol
0     A      B    0.5     AB
1     B      A    0.5     AB
2     A      C    0.2     AC

# get its non-duplicated indexes and remove the new column: 
df = df[~df.newcol.duplicated()].iloc[:,:3]
print(df)

  First Second  Value
0     A      B    0.5
2     A      C    0.2

Upvotes: 0

cs95

Reputation: 402383

You could call drop_duplicates on the np.sorted columns:

df = df.loc[~pd.DataFrame(np.sort(df.iloc[:, :2])).duplicated()]
df

  First Second  Value
0     A      B    0.5
2     A      C    0.2

Details

np.sort(df.iloc[:, :2])

array([['A', 'B'],
       ['A', 'B'],
       ['A', 'C']], dtype=object)

~pd.DataFrame(np.sort(df.iloc[:, :2], axis=1)).duplicated()

0     True
1    False
2     True
dtype: bool

Sort the columns and figure out which ones are duplicates. The mask will then be used to filter out the dataframe via boolean indexing.

To reset the index, use reset_index:

df.reset_index(drop=1)

  First Second  Value
0     A      B    0.5
1     A      C    0.2

Upvotes: 1

jezrael

Reputation: 862581

Use:

#if want select columns by columns names
m = ~pd.DataFrame(np.sort(df[['First','Second']], axis=1)).duplicated()
#if want select columns by positons
#m = ~pd.DataFrame(np.sort(df.iloc[:,:2], axis=1)).duplicated()
print (m)

0     True
1    False
2     True
dtype: bool

df = df[m]
print (df)
  First Second  Value
0     A      B    0.5
2     A      C    0.2

Upvotes: 1

Remove duplicates based on the content of two columns not the order

Answers (3)

Related Questions