Reputation: 557
I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
Upvotes: 4
Views: 1514
Reputation: 5898
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]]
because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Upvotes: 2
Reputation: 46866
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges
package from Bioconductor (?findInterval
in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query
overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps
) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table
. The original coordinates are start(v)[queryHits(olaps)]
, so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Upvotes: 6
Reputation: 43265
Since your answer isn't easily reproducible and even if it were, none of your subscores
meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply
family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply
. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach
package with similar results. You've got your for loop
"correct" by assigning into results
rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.
Upvotes: 0