user2117258
user2117258

Reputation: 515

Speeding up a large for loop using a parallel foreach

I'm working on a project where I iterate over bins of the mouse genome and compute some overlaps using GenomicRanges/rtracklayer. Bins are already computed in 150 base/position increments. For those who are not familiar, this means that for chromosomeY, which happens to be one of the shortest mouse chromosomes, there are approximately 106,000 bins and the larger chromosomes contain about 1,300,000 bins! The goal here is to iterate through the bins, expand the range of the bins 100,000 positions in both directions, then figure out which genes overlap these window. I want to compute the distance from the center of the bin to the start of the genes contained within the expanded bin window.

Okay so here's the code I've written to this point. It runs without error and computes exactly what I need. The problem here is that it is slow and will take forever to iterate through 1M+ bins.

progress <- function(w) {
  # just a function that prints out the window being processed
  cat(sprintf(paste0(as.character(Sys.time()), ": Window ", w, " completed!\n")))
}

extend <- function(x, upstream=0, downstream=0) {
# this will expand a `GenomicRanges` object range 
  if (any(strand(x) == "*"))
    warning("'*' ranges were treated as '+'")
  on_plus <- strand(x) == "+" | strand(x) == "*"
  new_start <- start(x) - ifelse(on_plus, upstream, downstream)
  new_end <- end(x) + ifelse(on_plus, downstream, upstream)
  ranges(x) <- IRanges(new_start, new_end)
  trim(x)
}


feature.overlap <- function(x, window, genes, extend.upstream=100000, extend.downstream=100000) {

  # # test case
  # x = chrY; window = 2668 ; genes = gene; extend.upstream = 100000 ; extend.downstream = 100000

  # extend window of signal in both directions
  x.window = extend(x[window], extend.upstream, extend.downstream) 
  names(x.window) <- window

  # compute signal window overlap with genes 
  overlaps <- subsetByOverlaps(genes, x.window)  

  if(length(overlaps) == 0){

    values <- data.frame(signal_window=names(x.window), 
                         signal_start=max(0, start(x.window)), 
                         signal_center=max(0, start(x.window)) + floor((width(x.window) - 1)/2), 
                         signal_end=end(x.window), 
                         signal_score=x.window$score,
                         symbol=NA, 
                         gene_id=NA,
                         gene_chr=NA,
                         gene_start=NA,
                         gene_end=NA,
                         gene_strand=NA)  

  } else {

    hits <- findOverlaps(x.window, genes)
    s.idx <- unique(subjectHits(hits))
    q.idx <- unique(queryHits(hits))    

    values <- data.frame(signal_window=names(x.window)[q.idx], 
                         signal_start=max(0, start(x.window)[q.idx]), 
                         signal_center=max(0, start(x.window)[q.idx]) + floor((width(x.window)[q.idx] - 1)/2), 
                         signal_end=end(x.window)[q.idx], 
                         signal_score=x.window$score[q.idx],
                         mcols(overlaps)[,c(2,1)], 
                         gene_chr=chrom(genes)[s.idx],
                         gene_start=ifelse(strand(genes)[s.idx] == '+', start(genes)[s.idx], end(genes)[s.idx]) ,
                         gene_end=end(genes)[s.idx],
                         gene_strand=strand(genes)[s.idx])

  }

  return(values)

}

# Import data
library(rtracklayer)
merged_wig <- import.wig('~/file/linked/below.wig', format='wig', genome='mm9')
merged_wig <- keepSeqlevels(merged_wig, paste0('chr', c(seq(1,19), 'X', 'Y')))
chrY <- merged_wig[seqnames(merged_wig) == 'chrY'] 

# Generate gene info needed for computing overlap
library(TxDb.Mmusculus.UCSC.mm9.knownGene); library(Mus.musculus)
gene <- genes(TxDb.Mmusculus.UCSC.mm9.knownGene)
values(gene) <- merge(values(gene), as.data.frame(org.Mm.egSYMBOL), by='gene_id', all.x=T)
gene <- keepSeqlevels(gene, paste0('chr', c(seq(1,19), 'X', 'Y')))

# BEGIN LOOP GENOME WINDOWS *** TIME CONSUMING ***
window.overlaps <- list()
ptm <- proc.time()
for(i in 1:100) { # ideally 1:length(chrY) but this takes very long so I've only posted a few windows
  result = feature.overlap(chrY, i, gene, extend.upstream=100000, extend.downstream=100000)
  window.overlaps[[i]] <- result
  progress(i)
}
proc.time() - ptm
all.overlaps = do.call(rbind, window.overlaps)

The code above will run right our of the box with this file (88mb).

Here is my attempt to speed up the look using a foreach with the doParallel library:

library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
ptm <- proc.time()
ls<-foreach(i = 1:100, chrY=chrY, gene=gene, .packages=c('rtracklayer', 'GenomicRanges')) %dopar% {
  result = feature.overlap(chrY, i, gene, extend.upstream=100000, extend.downstream=100000)
  progress(i)
  result
}
proc.time() - ptm
stopCluster(cl)

But, this codes does not work. The error returned is Error: this S4 class is not subsettable and no output is produced from progress(). ERROR FIXED - SEE EDIT

Again, the goal here is to write this in a more efficient manner. Once I have values I can easily compute the metrics I need.

Any help would be much appreciated! Thank you!

EDIT: I've implement a working foreach loop with dopar, but it seems to be even slower than the above implementation.

library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
ptm <- proc.time()
ls <- foreach(i = 1:100, .combine='rbind', .packages=c('rtracklayer', 'GenomicRanges')) %dopar% { 
  result = feature.overlap(chrY, i, gene, counts, extend.upstream=100000, extend.downstream=100000)
  progress(i)
  result
}
proc.time() - ptm
stopCluster(cl)

For 100 windows this takes about 10 seconds whereas the same windows being processing with the for loop above takes 6 seconds.

Upvotes: 1

Views: 448

Answers (1)

F. Priv&#233;
F. Priv&#233;

Reputation: 11738

Okay, I managed to run your code.

  • For 100 runs, it takes
    • 6 sec with ncores = 1
    • 7 sec with ncores = 2
    • 8 sec with ncores = 3
    • 10 sec with ncores = 4
  • For 1000 runs, it takes
    • 56 secs to run sequentially
    • 33 sec with 2 cores
    • 33 sec with 3 cores

So, basically, the 7 sec you see is due to overhead of parallelization. So the more you do, the more it will be useful to parallelize.

Then, what we can verify here, it is often inefficient to use more than half of the cores you have on your computer. So, if you have 8, you should use only 4.

Finally, if you have a very long loop, don't use the .combine parameter because it will take blocks of 100 and combine them, which is slow. If you can, just do a Reduce at the end.

Upvotes: 1

Related Questions