Reputation: 515
I'm working on a project where I iterate over bins of the mouse genome and compute some overlaps using GenomicRanges
/rtracklayer
. Bins are already computed in 150 base/position increments. For those who are not familiar, this means that for chromosomeY, which happens to be one of the shortest mouse chromosomes, there are approximately 106,000 bins and the larger chromosomes contain about 1,300,000 bins! The goal here is to iterate through the bins, expand the range of the bins 100,000 positions in both directions, then figure out which genes overlap these window. I want to compute the distance from the center of the bin to the start of the genes contained within the expanded bin window.
Okay so here's the code I've written to this point. It runs without error and computes exactly what I need. The problem here is that it is slow and will take forever to iterate through 1M+ bins.
progress <- function(w) {
# just a function that prints out the window being processed
cat(sprintf(paste0(as.character(Sys.time()), ": Window ", w, " completed!\n")))
}
extend <- function(x, upstream=0, downstream=0) {
# this will expand a `GenomicRanges` object range
if (any(strand(x) == "*"))
warning("'*' ranges were treated as '+'")
on_plus <- strand(x) == "+" | strand(x) == "*"
new_start <- start(x) - ifelse(on_plus, upstream, downstream)
new_end <- end(x) + ifelse(on_plus, downstream, upstream)
ranges(x) <- IRanges(new_start, new_end)
trim(x)
}
feature.overlap <- function(x, window, genes, extend.upstream=100000, extend.downstream=100000) {
# # test case
# x = chrY; window = 2668 ; genes = gene; extend.upstream = 100000 ; extend.downstream = 100000
# extend window of signal in both directions
x.window = extend(x[window], extend.upstream, extend.downstream)
names(x.window) <- window
# compute signal window overlap with genes
overlaps <- subsetByOverlaps(genes, x.window)
if(length(overlaps) == 0){
values <- data.frame(signal_window=names(x.window),
signal_start=max(0, start(x.window)),
signal_center=max(0, start(x.window)) + floor((width(x.window) - 1)/2),
signal_end=end(x.window),
signal_score=x.window$score,
symbol=NA,
gene_id=NA,
gene_chr=NA,
gene_start=NA,
gene_end=NA,
gene_strand=NA)
} else {
hits <- findOverlaps(x.window, genes)
s.idx <- unique(subjectHits(hits))
q.idx <- unique(queryHits(hits))
values <- data.frame(signal_window=names(x.window)[q.idx],
signal_start=max(0, start(x.window)[q.idx]),
signal_center=max(0, start(x.window)[q.idx]) + floor((width(x.window)[q.idx] - 1)/2),
signal_end=end(x.window)[q.idx],
signal_score=x.window$score[q.idx],
mcols(overlaps)[,c(2,1)],
gene_chr=chrom(genes)[s.idx],
gene_start=ifelse(strand(genes)[s.idx] == '+', start(genes)[s.idx], end(genes)[s.idx]) ,
gene_end=end(genes)[s.idx],
gene_strand=strand(genes)[s.idx])
}
return(values)
}
# Import data
library(rtracklayer)
merged_wig <- import.wig('~/file/linked/below.wig', format='wig', genome='mm9')
merged_wig <- keepSeqlevels(merged_wig, paste0('chr', c(seq(1,19), 'X', 'Y')))
chrY <- merged_wig[seqnames(merged_wig) == 'chrY']
# Generate gene info needed for computing overlap
library(TxDb.Mmusculus.UCSC.mm9.knownGene); library(Mus.musculus)
gene <- genes(TxDb.Mmusculus.UCSC.mm9.knownGene)
values(gene) <- merge(values(gene), as.data.frame(org.Mm.egSYMBOL), by='gene_id', all.x=T)
gene <- keepSeqlevels(gene, paste0('chr', c(seq(1,19), 'X', 'Y')))
# BEGIN LOOP GENOME WINDOWS *** TIME CONSUMING ***
window.overlaps <- list()
ptm <- proc.time()
for(i in 1:100) { # ideally 1:length(chrY) but this takes very long so I've only posted a few windows
result = feature.overlap(chrY, i, gene, extend.upstream=100000, extend.downstream=100000)
window.overlaps[[i]] <- result
progress(i)
}
proc.time() - ptm
all.overlaps = do.call(rbind, window.overlaps)
The code above will run right our of the box with this file (88mb).
Here is my attempt to speed up the look using a foreach
with the doParallel
library:
library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
ptm <- proc.time()
ls<-foreach(i = 1:100, chrY=chrY, gene=gene, .packages=c('rtracklayer', 'GenomicRanges')) %dopar% {
result = feature.overlap(chrY, i, gene, extend.upstream=100000, extend.downstream=100000)
progress(i)
result
}
proc.time() - ptm
stopCluster(cl)
But, this codes does not work. The error returned is Error: this S4 class is not subsettable
and no output is produced from progress()
. ERROR FIXED - SEE EDIT
Again, the goal here is to write this in a more efficient manner. Once I have values
I can easily compute the metrics I need.
Any help would be much appreciated! Thank you!
EDIT: I've implement a working foreach loop with dopar, but it seems to be even slower than the above implementation.
library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
ptm <- proc.time()
ls <- foreach(i = 1:100, .combine='rbind', .packages=c('rtracklayer', 'GenomicRanges')) %dopar% {
result = feature.overlap(chrY, i, gene, counts, extend.upstream=100000, extend.downstream=100000)
progress(i)
result
}
proc.time() - ptm
stopCluster(cl)
For 100 windows this takes about 10 seconds whereas the same windows being processing with the for loop above takes 6 seconds.
Upvotes: 1
Views: 448
Reputation: 11738
Okay, I managed to run your code.
So, basically, the 7 sec you see is due to overhead of parallelization. So the more you do, the more it will be useful to parallelize.
Then, what we can verify here, it is often inefficient to use more than half of the cores you have on your computer. So, if you have 8, you should use only 4.
Finally, if you have a very long loop, don't use the .combine
parameter because it will take blocks of 100 and combine them, which is slow. If you can, just do a Reduce
at the end.
Upvotes: 1