R: Conditionally remove duplicated rows

Question

I have a large ordered data frame in R that contains two columns, "gene" and "rank", and I would like to remove the second occurrence of the duplicated row in the positive direction AND the first occurrence of the duplicated row in the negative direction.

For example, in the following data set, I would like to remove lines 6262 and 12200, such that I keep the genes with the greatest rank value in either direction:

> head(a_ordered, n=10)
              gene     rank
10597      SLC37A1  2.827330
6262       SLC37A1  2.700882
12504         UBR4  2.698938
10934        SP110  2.690130
1563         CALB1  2.633179
6031  LOC100128239 -2.499959
6718        MBTPS1 -2.513134
1528          CA14 -2.549553
12200        NXPE3 -2.850011
7978         NXPE3 -3.153175

Becomes,

 > head(a_ordered, n=10)
                  gene     rank
    10597      SLC37A1  2.827330
    12504         UBR4  2.698938
    10934        SP110  2.690130
    1563         CALB1  2.633179
    6031  LOC100128239 -2.499959
    6718        MBTPS1 -2.513134
    1528          CA14 -2.549553
    7978         NXPE3 -3.153175

Thank you!

aichao · Accepted Answer

You can use dplyr::filter to keep those rows for which abs(rank) == max(abs(rank)) grouped by gene:

library(dplyr)
res <- df %>% group_by(gene) %>% filter(abs(rank) == max(abs(rank)))
##Source: local data frame [8 x 3]
##Groups: gene [8]
##
##     ID         gene      rank
##               
##1 10597      SLC37A1  2.827330
##2 12504         UBR4  2.698938
##3 10934        SP110  2.690130
##4  1563        CALB1  2.633179
##5  6031 LOC100128239 -2.499959
##6  6718       MBTPS1 -2.513134
##7  1528         CA14 -2.549553
##8  7978        NXPE3 -3.153175

Data:

df <- structure(list(ID = c(10597L, 6262L, 12504L, 10934L, 1563L, 6031L, 
6718L, 1528L, 12200L, 7978L), gene = c("SLC37A1", "SLC37A1", 
"UBR4", "SP110", "CALB1", "LOC100128239", "MBTPS1", "CA14", "NXPE3", 
"NXPE3"), rank = c(2.82733, 2.700882, 2.698938, 2.69013, 2.633179, 
-2.499959, -2.513134, -2.549553, -2.850011, -3.153175)), .Names = c("ID", 
"gene", "rank"), class = "data.frame", row.names = c(NA, -10L
))
##      ID         gene      rank
##1  10597      SLC37A1  2.827330
##2   6262      SLC37A1  2.700882
##3  12504         UBR4  2.698938
##4  10934        SP110  2.690130
##5   1563        CALB1  2.633179
##6   6031 LOC100128239 -2.499959
##7   6718       MBTPS1 -2.513134
##8   1528         CA14 -2.549553
##9  12200        NXPE3 -2.850011
##10  7978        NXPE3 -3.153175

R: Conditionally remove duplicated rows

Answers (2)

Related Questions