aboozar soorni
aboozar soorni

Reputation: 23

Remove reversed duplicates from a data frame

Can anyone suggest a good solution to remove reversed duplicates from a data frame?

My data looks like this, where first and second columns are reversed duplicates.

TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.491  228 8   0   202 429 417 190 3.049999999999999e-104  377
TRINITY_DN16813_c0_g1_i4    TRINITY_DN16813_c0_g1_i3    96.104  231 9   0   190 420 429 199 2.979999999999999e-104  377

I need to keep only one row, where third column has the higher value

TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.491  228 8   0   202 429 417 190 3.049999999999999e-104  377

This the results when I use series.isin().

TRINITY_DN28139_c0_g1_i2    TRINITY_DN28139_c0_g1_i5    99.971  3465    1   0   1   3465    1   3465    0.0 6394
TRINITY_DN28139_c0_g1_i5    TRINITY_DN28139_c0_g1_i2    99.971  3465    1   0   1   3465    1   3465    0.0 6394
TRINITY_DN25313_c0_g1_i6    TRINITY_DN25313_c0_g1_i5    99.97   3315    1   0   1   3315    1   3315    0.0 6117
TRINITY_DN25313_c0_g1_i5    TRINITY_DN25313_c0_g1_i6    99.97   3315    1   0   1   3315    1   3315    0.0 6117
TRINITY_DN25502_c0_g1_i3    TRINITY_DN25502_c0_g1_i4    99.96799999999999   3078    1   0   1   3078    1   3078    0.0 5679
TRINITY_DN25502_c0_g1_i4    TRINITY_DN25502_c0_g1_i3    99.96799999999999   3078    1   0   1   3078    1   3078    0.0 5679
TRINITY_DN28726_c0_g1_i2    TRINITY_DN28726_c0_g1_i1    99.96600000000001   5805    2   0   1   5805    1   5805    0.0 10709
TRINITY_DN28726_c0_g1_i1    TRINITY_DN28726_c0_g1_i2    99.96600000000001   5805    2   0   1   5805    1   5805    0.0 10709
TRINITY_DN27942_c0_g1_i7    TRINITY_DN27942_c0_g1_i6    99.964  2760    1   0   1   2760    1   2760    0.0 5092
TRINITY_DN25118_c0_g1_i1    TRINITY_DN25118_c0_g1_i2    99.964  2770    1   0   81  2850    204 2973    0.0 5110
TRINITY_DN27942_c0_g1_i6    TRINITY_DN27942_c0_g1_i7    99.964  2760    1   0   1   2760    1   2760    0.0 5092
TRINITY_DN25118_c0_g1_i2    TRINITY_DN25118_c0_g1_i1    99.964  2770    1   0   204 2973    81  2850    0.0 5110
TRINITY_DN28502_c1_g1_i9    TRINITY_DN28502_c1_g1_i7    99.963  2678    1   0   1928    4605    2021    4698    0.0 4940
TRINITY_DN28502_c1_g1_i7    TRINITY_DN28502_c1_g1_i9    99.963  2678    1   0   2021    4698    1928    4605    0.0 4940
TRINITY_DN25619_c0_g1_i1    TRINITY_DN25619_c0_g1_i8    99.963  2715    1   0   1   2715    1   2715    0.0 5009
TRINITY_DN25619_c0_g1_i8    TRINITY_DN25619_c0_g1_i1    99.963  2715    1   0   1   2715    1   2715    0.0 5009
TRINITY_DN23022_c0_g1_i5    TRINITY_DN23022_c0_g1_i1    99.962  2622    1   0   1   2622    1   2622    0.0 4837

Upvotes: 2

Views: 363

Answers (3)

kentwait
kentwait

Reputation: 2071

Try this one. It's completely in pandas (should be faster) This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.

In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)

Get only max values per duplicated result:

In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]

If you need the names to be in separate columns:

In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])

Upvotes: 1

kentwait
kentwait

Reputation: 2071

The problem is that labels in column 0 and column 1 must be taken as a pair so an isin alone would not work

First, a list of label pairs is needed to compare to (forward in the code). Given that (a,b) is the same as (b,a), all instances will just be replaced by (a,b)

Then all labels that are duplicated are renamed in the order a,b even if the higher row is b,a. This is necessary to do the grouping step later.

In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)

Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept. This will be the rows to be removed.

In [297]: dfi = df.set_index(['pair',2])

In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index

In [299]: dfi['drop'] = dfi.index.isin(to_drop)

In [300]: dfr = dfi.reset_index()

Rows are dropped by the index number where the 'drop' column is True. The temporary 'drop' column is also removed.

In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)

In [302]: df_dropped
Out[302]:
                         0                         1       2    3   4   5    6    7    8    9              10   11
0  TRINITY_DN16813_c0_g1_i3  TRINITY_DN16813_c0_g1_i4  96.491  228   8   0  202  429  417  190  3.050000e-104  377

Upvotes: 0

anky
anky

Reputation: 75100

Use series.isin() to find same entries in both columns and drop duplicates:

df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]

Where col1 is the first column and col2 is the second

Output:

0   TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.49   228 8   0   202 429 417 190 0.00    377

Upvotes: 1

Related Questions