Reputation: 3190
I have a dataset with 2 columns like the following...
InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
I'm using Pandas and I want to drop rows which are present twice but simply reversed like the following... from this...
InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741
To this...
InteractorA InteractorB
AGAP002741 AGAP008026
As they are for all intents and purposes the same thing.
Is there a built in method to handle this?
Upvotes: 9
Views: 5285
Reputation: 1
Was looking to solve a similar problem today. The answer by A.Kot put me in the right direction. Below is a working example. Copied the data preparation from the answer by EdChum.
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
# One liner to drop the duplicates
df.loc[df.apply(lambda x: set(x[['InteractorA', 'InteractorB']]), axis=1).drop_duplicates().index]```
Upvotes: 0
Reputation: 7913
This is the cleanest solution I've managed to make work for my own purposes.
Create a column that has each row combined in a sorted list
df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]
Can't drop duplicates on a list so that column should be a string
df['sorted_row'] = df['sorted_row'].astype(str)
Drop Duplicates
df.drop_duplicates(subset=['sorted_row'], inplace=True)
Upvotes: 2
Reputation: 3190
I ended up making a hacky script which iterates over the rows and the necessary pieces of data and checks whether the concatenate appears or if its reverse appears and drops row indexes as appropriate.
import pandas as pd
checklist = []
indexes_to_drop = []
interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')
for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)
no_dups = interactions.drop(interactions.index[indexes_to_drop])
print no_dups.shape
no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)
2017 EDIT: a few years on, with a bit more experience, this is a much more elegant solution for anyone looking for something similar:
In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)
In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026
In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
Upvotes: 14
Reputation: 394179
I think the following would work:
In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
So I downloaded your data and misunderstood what you wanted so the following will now work:
# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)
Now we want to get the duplicated rows but take the first value:
In [74]:
df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)
now concat the 2 dataframes:
In [75]:
merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)
I think this is now correct.
Upvotes: 0