Reputation: 5719
I have a dataframe called mydf
. I also have a vector called myvec <- c("chr5:11", "chr3:112", "chr22:334")
. What I want to do is select range (including 3 values above and 3 values below) of rows if any of the vector elements match the key in mydf
and make a subset of mydf
(result
).
Since in the myvec
we have chr5:11 matching with the key in mydf
, we are selecting rows matching chr5:8 (three values below) to chr5:14 (three values above) in the result
.
mydf<- structure(list(key = structure(c(5L, 2L, 7L, 8L, 4L, 1L, 6L,
3L, 11L, 10L, 9L), .Names = c("34", "35", "36", "37", "38", "39",
"40", "41", "42", "43", "44"), .Label = c("chr5:10", "chr5:11",
"chr5:1123", "chr5:118", "chr5:12", "chr5:123", "chr5:13", "chr5:14",
"chr5:19", "chr5:8", "chr5:9"), class = "factor"), variantId = structure(1:11, .Names = c("34",
"35", "36", "37", "38", "39", "40", "41", "42", "43", "44"), .Label = c("9920068",
"9920069", "9920070", "9920071", "9920072", "9920073", "9920074",
"9920075", "9920076", "9920077", "9920078"), class = "factor")), .Names = c("key",
"variantId"), row.names = c("34", "35", "36", "37", "38", "39",
"40", "41", "42", "43", "44"), class = "data.frame")
result
key variant
43 "chr5:8" "9920077"
42 "chr5:9" "9920076"
39 "chr5:10" "9920073"
35 "chr5:11" "9920069"
34 "chr5:12" "9920068"
36 "chr5:13" "9920070"
37 "chr5:14" "9920071"
Upvotes: 0
Views: 1338
Reputation: 34703
How about the following (I use data.table
but the base
version is almost the same)
library(data.table)
mydf <- as.data.table(mydf) #(if mydf really is stored as a matrix currently)
myvec2 <- lapply(strsplit(gsub("chr", "", myvec), split=":"), as.integer)
mydf[unique(Reduce(c, sapply(myvec2, function(x){
which(key %in% paste0("chr", x[1], ":", seq((x2 <- x[2]) - 3L, x2 + 3L)))}
))), ]
(in base
, replace as.data.table
with as.data.frame
,key
with mydf$key
, and replace the closing square bracket ]
with ,]
)
Actually, I think this option is better in general, since it stores your information in a more pliable way in the first place. This version's a bit heavier in the data.table
parlance.
mydf <- as.data.table(mydf)
#Split your `key` variable into its pre- and post-colon components
# (of course using better names if those numbers mean something
# more specific to you)
mydf[ , c("chr", "sub") :=
.(as.integer(gsub("chr|:.*", "", key)),
as.integer(gsub(".*:", "", key)))]
Now, proceeding much as before with a slight tweak:
myvec2<-lapply(strsplit(gsub("chr","",myvec),split=":"),as.integer)
mydf[unique(Reduce(c, sapply(myvec2, function(x){
which(chr == x[1] & sub %in% seq((x2 <- x[2]) - 3L, x2 + 3L))}
)))][order(chr, sub)]
Outputs:
key variantId chr sub
1: chr5:8 9920077 5 8
2: chr5:9 9920076 5 9
3: chr5:10 9920073 5 10
4: chr5:11 9920069 5 11
5: chr5:12 9920068 5 12
6: chr5:13 9920070 5 13
7: chr5:14 9920071 5 14
Upvotes: 3
Reputation: 3710
You can use the GenomicRanges
package.
library(GenomicRanges)
myvec <- c("chr5:11", "chr3:112", "chr22:334")
myvec.gr <- GRanges(gsub(":.+", "", myvec),
IRanges(as.numeric(gsub(".+:", "", myvec))-3,
as.numeric(gsub(".+:", "", myvec)))+3)
mydf.gr <- GRanges(gsub(":.+", "", mydf[,"key"]),
IRanges(as.numeric(gsub(".+:", "", mydf[,"key"])),
as.numeric(gsub(".+:", "", mydf[,"key"]))))
d.v.op <- findOverlaps(mydf.gr, myvec.gr)
mydf[queryHits(d.v.op), ]
# key variantId
# 34 "chr5:12" "9920068"
# 35 "chr5:11" "9920069"
# 36 "chr5:13" "9920070"
# 37 "chr5:14" "9920071"
# 39 "chr5:10" "9920073"
# 42 "chr5:9" "9920076"
# 43 "chr5:8" "9920077"
Upvotes: 2