Reputation: 37
I have a table of ranges (start, stop), which looks something like this:
ID | start | stop |
---|---|---|
x1 | 351525 | 352525 |
x2 | 136790 | 136990 |
x3 | 74539 | 74739 |
x4 | 478181 | 478381 |
... | ... | ... |
I also have a vector of positions.
The data can be simulated with:
s=round(runif(50,0,500000),0)
# ranges:
# (+200 is random, the difference my be more or less than that, but stop is always higher than start)
ranges=cbind(ID=paste0("x",1:50), start=s, stop=s+200)
# positions
pos=round(runif(5000,0,500000),0)
I want to select all IDs which have at least one position within their range.
I could loop through ranges and pos:
library(dplyr)
selected.IDs <- c()
for(r in 1:nrow(ranges)){
for(p in 1:length(pos)){
if(between(pos[p],left = ranges[r,2], right = ranges[r,3])){
selected.IDs <- append(selected.IDs, ranges[r,1])
break
} else{next}
}
}
That works fine (I think). However, the 'ranges' object has 83,000 rows and there are 180,000 position. It takes a long time to loop through all of them.
Does anyone has an idea how to do that without a loop?
Thanks
Upvotes: 2
Views: 81
Reputation: 561
I usually do this using overlap joins with data.table::foverlaps
.
s <- round(runif(50,0,500000),0)
# ranges:
# (+200 is random, the difference my be more or less than that, but stop is always higher than start)
ranges <- data.table(ID=paste0("x",1:50), start=s, stop=s+200)
# positions
pos <- round(runif(5000,0,500000),0)
pos <- data.table(start = pos, stop = pos + 1)
setkey(pos, start, stop)
setkey(ranges, start, stop)
res <- foverlaps(ranges, pos, nomatch = 0)
selected.IDs <- res$ID
Upvotes: 2