Reputation: 23
Can anyone suggest a good solution to remove reversed duplicates from a data frame?
My data looks like this, where first and second columns are reversed duplicates.
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
TRINITY_DN16813_c0_g1_i4 TRINITY_DN16813_c0_g1_i3 96.104 231 9 0 190 420 429 199 2.979999999999999e-104 377
I need to keep only one row, where third column has the higher value
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
This the results when I use series.isin().
TRINITY_DN28139_c0_g1_i2 TRINITY_DN28139_c0_g1_i5 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN28139_c0_g1_i5 TRINITY_DN28139_c0_g1_i2 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN25313_c0_g1_i6 TRINITY_DN25313_c0_g1_i5 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25313_c0_g1_i5 TRINITY_DN25313_c0_g1_i6 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25502_c0_g1_i3 TRINITY_DN25502_c0_g1_i4 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN25502_c0_g1_i4 TRINITY_DN25502_c0_g1_i3 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN28726_c0_g1_i2 TRINITY_DN28726_c0_g1_i1 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN28726_c0_g1_i1 TRINITY_DN28726_c0_g1_i2 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN27942_c0_g1_i7 TRINITY_DN27942_c0_g1_i6 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i1 TRINITY_DN25118_c0_g1_i2 99.964 2770 1 0 81 2850 204 2973 0.0 5110
TRINITY_DN27942_c0_g1_i6 TRINITY_DN27942_c0_g1_i7 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i2 TRINITY_DN25118_c0_g1_i1 99.964 2770 1 0 204 2973 81 2850 0.0 5110
TRINITY_DN28502_c1_g1_i9 TRINITY_DN28502_c1_g1_i7 99.963 2678 1 0 1928 4605 2021 4698 0.0 4940
TRINITY_DN28502_c1_g1_i7 TRINITY_DN28502_c1_g1_i9 99.963 2678 1 0 2021 4698 1928 4605 0.0 4940
TRINITY_DN25619_c0_g1_i1 TRINITY_DN25619_c0_g1_i8 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN25619_c0_g1_i8 TRINITY_DN25619_c0_g1_i1 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN23022_c0_g1_i5 TRINITY_DN23022_c0_g1_i1 99.962 2622 1 0 1 2622 1 2622 0.0 4837
Upvotes: 2
Views: 363
Reputation: 2071
Try this one. It's completely in pandas (should be faster) This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.
In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
Get only max values per duplicated result:
In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]
If you need the names to be in separate columns:
In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])
Upvotes: 1
Reputation: 2071
The problem is that labels in column 0 and column 1 must be taken as a pair so an isin
alone would not work
First, a list of label pairs is needed to compare to (forward
in the code). Given that (a,b)
is the same as (b,a)
, all instances will just be replaced by (a,b)
Then all labels that are duplicated are renamed in the order a,b
even if the higher row is b,a
. This is necessary to do the grouping step later.
In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)
Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept. This will be the rows to be removed.
In [297]: dfi = df.set_index(['pair',2])
In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index
In [299]: dfi['drop'] = dfi.index.isin(to_drop)
In [300]: dfr = dfi.reset_index()
Rows are dropped by the index number where the 'drop' column is True. The temporary 'drop' column is also removed.
In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)
In [302]: df_dropped
Out[302]:
0 1 2 3 4 5 6 7 8 9 10 11
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.050000e-104 377
Upvotes: 0
Reputation: 75100
Use series.isin()
to find same entries in both columns and drop duplicates:
df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]
Where col1
is the first column and col2
is the second
Output:
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.49 228 8 0 202 429 417 190 0.00 377
Upvotes: 1