Meso
Meso

Reputation: 1435

How to increase efficiency of a slow loop

I have a time series data of daily deaths and various environmental factors with about 8000 data points, 11 outcomes and 6 pollutants. There was no convergence problem when I run the models individually, but it took about 20 minutes to run each. In loop the entire model has never completed and terminated for unidentified reason. The code shown below was executed for two outcomes and two pollutants and the elapsed time was 4123.59.

I am working on a fairly fast windows PC with 4 processors and 16 GB of RAM and still the whole process is painfully slow. I would appreciate any suggestion on how to make the code efficient and fast. I have checked some related answers of forum, but none were applicable to my specific problem.

A code run on my data

outcome<-c("cardva" ,"respir") 
pollut1<-c("o3","no2")

m1 <- lapply(outcome, function(o){
  lapply(pollut1,function(v) {
    f<- sprintf("%s ~  s(trend,k=21*50,fx=F,bs='cr')+ s(temp,k=6,fx=F,bs='cr') + rh + 
   as.factor(dow) + s(fluepi,k=4,fx=F, bs='cr') + as.factor(holiday) + %s",o, v)
    gam(as.formula(f),family=quasipoisson, na.action=na.omit,data=mortdf)

  })
})

Sample code and data is shown below:

library(quantmod)
 library(mgcv)
 library(dlnm) 
 df <- chicagoNMMAPS
 outcome<- c("death", "cvd", "resp ")
 pollut1<-c("pm10" , "o3" ) 

 ptm <- proc.time()

 mod1<- lapply(outcome, function(o){
   lapply(pollut1,function(v) {
     f <- sprintf("%s~ s(time,bs='cr',k=14*50)+ s(temp,k=6, bs='cr') + as.factor(dow)   + %s",o, v) 
     gam(as.formula(f),family=quasipoisson,na.action=na.omit,data=df) 
  })}) 

 proc.time() - ptm
 user  system elapsed 
 991.02    8.89 1002.00

Upvotes: 0

Views: 206

Answers (3)

Steph Locke
Steph Locke

Reputation: 6146

I'm not familiar with the functionality you're using so what I've produced doesn't actually work (mainly - trying to use the outcomes etc doesn't seem to exist in the sample dataset), but it's hopefully illsutrative.

doParallel (doMC could be used for linux) is like the snowfall package in @f3lix's answer and foreach provides some excellent parallel iterators. I've prepared a list of combinations for checking used foreach here because I think it is a simpler system to use than nesting lapply. Hopefully, it will provide you with some useful material for doing this calculation.

library(quantmod)
 library(mgcv)
 library(dlnm) 
 df <- chicagoNMMAPS
 outcome<- c("death", "cvd", "resp ")
 pollut1<-c("pm10" , "o3" ) 

library("doParallel")
library("foreach")
registerDoParallel(cores=8)

combinations<-expand.grid(outcome,pollut1)
mod1<- foreach(o=combinations, .combine='list') %dopar% {
    f <- as.formula(paste0("~ s(time,bs='cr',k=14*50)+ s(temp,k=6, bs='cr') + as.factor(dow) + ",o["Var1"], o["Var2"]) )
    gam(f,family=quasipoisson,na.action=na.omit,data=df) 
}

http://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf

Upvotes: 0

jlhoward
jlhoward

Reputation: 59355

Do you really need 14*50 = 700 dimensions in your basis set? That's the reason it is taking so long. It looks like t ~ O(k^2).

# 700 dimensions: 8 minutes, gcv = 1.22
f = death~ s(time,bs='cr',k=14*50)+ s(temp, bs='cr', k=6) + as.factor(dow)+pm10
system.time(g <- gam(f,family=quasipoisson,na.action=na.omit,data=df)) 
   user  system elapsed 
 457.66    2.17  461.90 
g$gcv
[1] 1.222779

# 200 dimentsions: 48 seconds; gcv = 1.25
f.new = death~ s(time,bs='cr',k=200)+ s(temp, bs='cr', k=6) + as.factor(dow)+pm10
system.time(g <- gam(f.new,family=quasipoisson,na.action=na.omit,data=df)) 
   user  system elapsed 
  47.93    0.07   48.04 
g$gcv
[1] 1.252921

# 100 dimensions: 15 seconds, gcv - 1.30
f.new = death~ s(time,bs='cr',k=100)+ s(temp, bs='cr', k=6) + as.factor(dow)+pm10
system.time(g <- gam(f.new,family=quasipoisson,na.action=na.omit,data=df))
   user  system elapsed 
  15.31    0.05   15.39 
g$gcv
[1] 1.297332

Upvotes: 1

f3lix
f3lix

Reputation: 29877

You can use one of the various R packages to run lapply in parallel on multiple processors. See for example sfLapply() from the snowfall package. If you were using Linux you could also easily use mclapply() instead of lapply()

Upvotes: 0

Related Questions