Reputation: 2498
I'm using a function to get p-values from multiple HWE chi square tests. I'm looping through a large matrix called geno.data
which is (313 rows x 355232 columns) to do this. I'm essentially looping two columns of the matrix at a time by row. It runs very slowly. How can I make it faster? Thanks
library(genetics)
geno.data<-matrix(c("a","c"), nrow=313,ncol=355232)
Num_of_SNPs<-ncol(geno.data) /2
alleles<- vector(length = nrow(geno.data))
HWE_pvalues<-vector(length = Num_of_SNPs)
j<- 1
for (count in 1:Num_of_SNPs){
for (i in 1:nrow(geno.data)){
alleles[i]<- levels(genotype(paste(geno.data[i,c(2*j -1, 2*j)], collapse = "/")))
}
g2 <- genotype(alleles)
HWE_pvalues[count]<-HWE.chisq(g2)[3]
j = j + 2
}
Upvotes: 1
Views: 283
Reputation: 44320
First, note that the posted code will result in an index-out-of-bounds error, because after Num_of_SNPs
iterations of the main loop your j
value will be ncol(geno.data)-1
and you're accessing columns 2*j-1
and 2*j
. I'm assuming you instead want columns 2*count-1
and 2*count
and j
can be removed.
Vectorization is extremely important for writing fast R code. In your code you're calling the paste
function 313 times, each time passing vectors of length 1. It's much faster in R to call paste
once passing vectors of length 313. Here are the original and vectorized interiors of the main for loop:
# Original
get.pval1 <- function(count) {
for (i in 1:nrow(geno.data)){
alleles[i]<- levels(genotype(paste(geno.data[i,c(2*count -1, 2*count)], collapse = "/")))
}
g2 <- genotype(alleles)
HWE.chisq(g2)[3]
}
# Vectorized
get.pval2 <- function(count) {
g2 <- genotype(paste0(geno.data[,2*count-1], "/", geno.data[,2*count]))
HWE.chisq(g2)[3]
}
We get about a 20x speedup from the vectorization:
library(microbenchmark)
all.equal(get.pval1(1), get.pval2(1))
# [1] TRUE
microbenchmark(get.pval1(1), get.pval2(1))
# Unit: milliseconds
# expr min lq mean median uq max neval
# get.pval1(1) 299.24079 304.37386 323.28321 307.78947 313.97311 482.32384 100
# get.pval2(1) 14.23288 14.64717 15.80856 15.11013 16.38012 36.04724 100
With the vectorized code, your code should finish in about 177616*.01580856 = 2807.853 seconds, or about 45 minutes (compared to 16 hours for the original code). If this is still not fast enough for you, then I would encourage you to look at the parallel
package in R. The mcmapply
should give a good speedup for you, since each iteration of the outer for
loop is independent.
Upvotes: 4