Reputation: 327
Already this question is answered here ,but could not make it work.
I have a data frame here,interested to remove the duplicate rows based on SYMBOL
. Checking the column CALL
to remove the duplicates.The priority is P>A>M
.If P,A,M
keep P
, if A,M
, keep A
, otherwise M
.
SYMBOL INTENSITY CALL
1 DDR1 596.95050 P
2 RFC2 420.28708 P
3 HSPA6 510.73254 P
4 DDR1 1717.99487 A
5 GUCA1A 121.53488 A
6 UBA7 1810.49780 P
7 UBA7 301.51944 M
8 GUCA1A 34.53987 A
9 CCL5 5966.24609 P
10 CYP2E1 95.15707 A
11 CYP2E1 164.95276 M
12 ESRRA 1024.88745 P
13 CYP2A6 502.48877 A
14 GAS6 921.70923 P
15 MMP14 524.96863 A
16 GAS6 3069.48462 P
17 FNTB 266.77686 A
18 PLD1 187.65569 A
19 PLD1 1891.04541 P
20 PLD1 258.79028 M
I tried the code which i found Here
library(data.table)
setDT(df)[, list(CALL=CALL[which.min(factor(CALL, levels=c('P', 'A', 'M')))]),
.(SYMBOL)]
But i removes the second column INTENSITY
. Any help, please make sure the code is fastest also. Thanks
Expected Output
SYMBOL INTENSITY CALL
1 DDR1 596.95050 P
2 RFC2 420.28708 P
3 HSPA6 510.73254 P
5 GUCA1A 121.53488 A
6 UBA7 1810.49780 P
9 CCL5 5966.24609 P
10 CYP2E1 95.15707 A
12 ESRRA 1024.88745 P
13 CYP2A6 502.48877 A
14 GAS6 921.70923 P
15 MMP14 524.96863 A
17 FNTB 266.77686 A
19 PLD1 1891.04541 P
Upvotes: 1
Views: 170
Reputation: 886938
You can either use order
(in the i
th position) to order the "CALL" column by converting to factor
with the levels
specified in the correct order, and subset the first observation (.SD[1L]
), grouped by 'SYMBOL'
library(data.table)
setDT(df)[order(factor(CALL, levels=c('P', 'A', 'M'))),
.SD[1L], by = SYMBOL]
Or just modifying your code, instead of list(CALL=..
, we can use .SD
to subset the rows.
setDT(df)[, .SD[which.min(factor(CALL, levels=c('P', 'A', 'M')))], .(SYMBOL)]
An option using dplyr
is
library(dplyr)
df %>%
group_by(SYMBOL) %>%
arrange(factor(CALL, levels=c('P', 'A', 'M'))) %>%
slice(1L)
Or use which.min
within slice
df %>%
group_by(SYMBOL) %>%
slice(which.min(factor(CALL, levels=c('P', 'A', 'M'))))
Upvotes: 4