Reputation: 63
I have a large ordered data frame in R that contains two columns, "gene" and "rank", and I would like to remove the second occurrence of the duplicated row in the positive direction AND the first occurrence of the duplicated row in the negative direction.
For example, in the following data set, I would like to remove lines 6262 and 12200, such that I keep the genes with the greatest rank value in either direction:
> head(a_ordered, n=10)
gene rank
10597 SLC37A1 2.827330
6262 SLC37A1 2.700882
12504 UBR4 2.698938
10934 SP110 2.690130
1563 CALB1 2.633179
6031 LOC100128239 -2.499959
6718 MBTPS1 -2.513134
1528 CA14 -2.549553
12200 NXPE3 -2.850011
7978 NXPE3 -3.153175
Becomes,
> head(a_ordered, n=10)
gene rank
10597 SLC37A1 2.827330
12504 UBR4 2.698938
10934 SP110 2.690130
1563 CALB1 2.633179
6031 LOC100128239 -2.499959
6718 MBTPS1 -2.513134
1528 CA14 -2.549553
7978 NXPE3 -3.153175
Thank you!
Upvotes: 2
Views: 782
Reputation: 7435
You can use dplyr::filter
to keep those rows for which abs(rank) == max(abs(rank))
grouped by gene
:
library(dplyr)
res <- df %>% group_by(gene) %>% filter(abs(rank) == max(abs(rank)))
##Source: local data frame [8 x 3]
##Groups: gene [8]
##
## ID gene rank
## <int> <chr> <dbl>
##1 10597 SLC37A1 2.827330
##2 12504 UBR4 2.698938
##3 10934 SP110 2.690130
##4 1563 CALB1 2.633179
##5 6031 LOC100128239 -2.499959
##6 6718 MBTPS1 -2.513134
##7 1528 CA14 -2.549553
##8 7978 NXPE3 -3.153175
Data:
df <- structure(list(ID = c(10597L, 6262L, 12504L, 10934L, 1563L, 6031L,
6718L, 1528L, 12200L, 7978L), gene = c("SLC37A1", "SLC37A1",
"UBR4", "SP110", "CALB1", "LOC100128239", "MBTPS1", "CA14", "NXPE3",
"NXPE3"), rank = c(2.82733, 2.700882, 2.698938, 2.69013, 2.633179,
-2.499959, -2.513134, -2.549553, -2.850011, -3.153175)), .Names = c("ID",
"gene", "rank"), class = "data.frame", row.names = c(NA, -10L
))
## ID gene rank
##1 10597 SLC37A1 2.827330
##2 6262 SLC37A1 2.700882
##3 12504 UBR4 2.698938
##4 10934 SP110 2.690130
##5 1563 CALB1 2.633179
##6 6031 LOC100128239 -2.499959
##7 6718 MBTPS1 -2.513134
##8 1528 CA14 -2.549553
##9 12200 NXPE3 -2.850011
##10 7978 NXPE3 -3.153175
Upvotes: 2
Reputation: 12410
If you want to keep the values that are most different from zero, you can use the squared value and then order the data frame accordingly
a_ordered$rank2 <- a_ordered$rank^2
a_ordered <- a_ordered[order(a_ordered$rank2, decreasing = TRUE),]
a_ordered <- a_ordered[-(duplicated(a_ordered$gene)==TRUE),]
Duplicated will go trough the column from top to bottom and flag the second (and third and so on) value as duplicate.
Upvotes: 0