Remove duplicate rows

Question

Already this question is answered here ,but could not make it work.

I have a data frame here,interested to remove the duplicate rows based on SYMBOL. Checking the column CALL to remove the duplicates.The priority is P>A>M.If P,A,M keep P, if A,M, keep A, otherwise M.

      SYMBOL  INTENSITY  CALL
1     DDR1    596.95050    P
2     RFC2    420.28708    P
3     HSPA6   510.73254    P
4     DDR1   1717.99487    A
5     GUCA1A  121.53488    A
6     UBA7   1810.49780    P
7     UBA7    301.51944    M
8     GUCA1A   34.53987    A
9     CCL5   5966.24609    P
10    CYP2E1   95.15707    A
11    CYP2E1  164.95276    M
12    ESRRA  1024.88745    P
13    CYP2A6  502.48877    A
14    GAS6    921.70923    P
15    MMP14   524.96863    A
16    GAS6   3069.48462    P
17    FNTB    266.77686    A
18    PLD1    187.65569    A
19    PLD1   1891.04541    P
20    PLD1    258.79028    M

I tried the code which i found Here

library(data.table)
setDT(df)[, list(CALL=CALL[which.min(factor(CALL, levels=c('P', 'A', 'M')))]),
                                   .(SYMBOL)]

But i removes the second column INTENSITY. Any help, please make sure the code is fastest also. Thanks

Expected Output

          SYMBOL  INTENSITY  CALL
    1     DDR1    596.95050    P
    2     RFC2    420.28708    P
    3     HSPA6   510.73254    P
    5     GUCA1A  121.53488    A
    6     UBA7   1810.49780    P
    9     CCL5   5966.24609    P
    10    CYP2E1   95.15707    A
    12    ESRRA  1024.88745    P
    13    CYP2A6  502.48877    A
    14    GAS6    921.70923    P
    15    MMP14   524.96863    A
    17    FNTB    266.77686    A
    19    PLD1   1891.04541    P

akrun · Accepted Answer

You can either use order (in the ith position) to order the "CALL" column by converting to factor with the levels specified in the correct order, and subset the first observation (.SD[1L]), grouped by 'SYMBOL'

library(data.table)
setDT(df)[order(factor(CALL, levels=c('P', 'A', 'M'))),
                                 .SD[1L], by = SYMBOL]

Or just modifying your code, instead of list(CALL=.., we can use .SD to subset the rows.

setDT(df)[, .SD[which.min(factor(CALL, levels=c('P', 'A', 'M')))], .(SYMBOL)]

An option using dplyr is

library(dplyr)
df %>%
    group_by(SYMBOL) %>%
    arrange(factor(CALL, levels=c('P', 'A', 'M'))) %>%
    slice(1L)

Or use which.min within slice

df %>%
    group_by(SYMBOL) %>%
    slice(which.min(factor(CALL, levels=c('P', 'A', 'M'))))

Remove duplicate rows

Answers (1)

Related Questions