Reputation: 3190

Pandas drop duplicates if reverse is present between two columns

I have a dataset with 2 columns like the following...

InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414

I'm using Pandas and I want to drop rows which are present twice but simply reversed like the following... from this...

InteractorA InteractorB
AGAP002741  AGAP008026
AGAP008026  AGAP002741

To this...

InteractorA InteractorB
AGAP002741  AGAP008026

As they are for all intents and purposes the same thing.

Is there a built in method to handle this?

Upvotes: 9

Answers (4)

ilyavs

Reputation: 1

Was looking to solve a similar problem today. The answer by A.Kot put me in the right direction. Below is a working example. Copied the data preparation from the answer by EdChum.

import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')

# One liner to drop the duplicates
df.loc[df.apply(lambda x: set(x[['InteractorA', 'InteractorB']]), axis=1).drop_duplicates().index]```

Upvotes: 0

A.Kot

Reputation: 7913

This is the cleanest solution I've managed to make work for my own purposes.

Create a column that has each row combined in a sorted list

df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]

Can't drop duplicates on a list so that column should be a string

df['sorted_row'] = df['sorted_row'].astype(str)

Drop Duplicates

df.drop_duplicates(subset=['sorted_row'], inplace=True)

Upvotes: 2

BML91

Reputation: 3190

I ended up making a hacky script which iterates over the rows and the necessary pieces of data and checks whether the concatenate appears or if its reverse appears and drops row indexes as appropriate.

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
    check_string = row['InteractorA'] + row['InteractorB']
    check_string_rev = row['InteractorB'] + row['InteractorA']
    if (check_string or check_string_rev) in checklist:
        indexes_to_drop.append(index)
    else:
        pass
    checklist.append(check_string)
    checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017 EDIT: a few years on, with a bit more experience, this is a much more elegant solution for anyone looking for something similar:

In [8]: df
Out[8]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026
7  AGAP008026  AGAP002741  AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026

Upvotes: 14

EdChum

Reputation: 394179

I think the following would work:

In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

So I downloaded your data and misunderstood what you wanted so the following will now work:

# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)

Now we want to get the duplicated rows but take the first value:

In [74]:

df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)

now concat the 2 dataframes:

In [75]:

merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)

I think this is now correct.

Upvotes: 0

Pandas drop duplicates if reverse is present between two columns

Answers (4)

Related Questions