Improving performance of sapply function that simulates qbeta values

Question

I use this sapply function:

set.seed(1)
data<-matrix(runif(1000000,0,1),1000000,2)

sapply(seq(0.0025, 0.9975, by=0.005), function (x) qbeta(x, data$a, data$b))

It can take a long time since data can have 1 mn rows. a and b are unique, random values.

How can I improve the performance? Is it the lookup of the parameters from each row taking time, or is it just unavoidable? I have already tried parallel version too, it cuts the time down, but its still slow.

Some results (i did this on 38k rows though):

> system.time(matrix(qbeta(rep(seq(0.0025, 0.9975, by=0.005),each=nrow(data)),data$a, data$b),nrow=nrow(data)))
   user  system elapsed 
  34.53    0.00   34.53 
> system.time(sapply(seq(0.0025, 0.9975, by=0.005), function (x) qbeta(x, data$a, data$b)))
   user  system elapsed 
  34.22    0.00   34.21

Here is my parallel code:

steps<-seq(0.0025, 0.9975, by=0.005)

qbeta.func <- function(x, data) {
  return(qbeta(x, data$a, data$b) * data$value)
}

cl <- makeCluster(rep("localhost",4), type = "SOCK")
t1 <- Sys.time()
data <- parSapply(cl, steps, qbeta.func, data)#
stopCluster(cl)
#data <- data[1:20,1:20]

nicola · Accepted Answer

You can obtain your result without using sapply since qbeta is vectorized. We repeat the grid values nrow(df) times. At the end, you obtain a matrix whose rows are the values of qbeta for the corresponding row of data. Notice: this can be slow given the huge amount of time. Don't think you can speed up things considerably from here, unless you parallelize or use a more powerful PC. Just try:

res<-matrix(qbeta(rep(seq(0.0025, 0.9975, by=0.005),
            each=nrow(data)),data$a, data$b),
            nrow=nrow(data))

Edit

Here I'm going to make a simple example of parallel. We use the doParallel package. We split the data data.frame into a list of chunks and then we call the line above for each chunk. From the start:

#create the data (just 10000 rows)
set.seed(1)    
data<-as.data.frame(matrix(runif(10000,0,1),10000,2,dimnames=list(NULL,letters[1:2])))
#split in 10 1000 rows chunks
dataSplit<-split(data,(seq_len(nrow(data))-1)%/%1000)
#define the function to make the qbeta calculation
qbetaVec<-function(grid,values)
           matrix(qbeta(rep(grid,each=nrow(values)),values$a,values$b),nrow=nrow(values))
#define the grid
grid<-seq(0.0025, 0.9975, by=0.005)
#full calculation
system.time(res<-qbetaVec(grid,data))   
# user  system elapsed 
#5.103   0.007   5.115
#now we parallelize
library(doParallel)
#set the number of clusters
cl <- makeCluster(8)
registerDoParallel(cl)
#now the calculation with foreach and dopar
system.time(res2<-foreach(i=1:10) %dopar% qbetaVec(grid,dataSplit[[i]]))
#  user  system elapsed 
# 0.026   0.019   1.404
#now we put all together
res3<-do.call(rbind,res2)
identical(res3,res)
#[1] TRUE

Improving performance of sapply function that simulates qbeta values

Answers (1)

Related Questions