Reputation: 1435
I have a time series data of daily deaths and various environmental factors with about 8000 data points, 11 outcomes and 6 pollutants. There was no convergence problem when I run the models individually, but it took about 20 minutes to run each. In loop the entire model has never completed and terminated for unidentified reason. The code shown below was executed for two outcomes and two pollutants and the elapsed time was 4123.59.
I am working on a fairly fast windows PC with 4 processors and 16 GB of RAM and still the whole process is painfully slow. I would appreciate any suggestion on how to make the code efficient and fast. I have checked some related answers of forum, but none were applicable to my specific problem.
A code run on my data
outcome<-c("cardva" ,"respir")
pollut1<-c("o3","no2")
m1 <- lapply(outcome, function(o){
lapply(pollut1,function(v) {
f<- sprintf("%s ~ s(trend,k=21*50,fx=F,bs='cr')+ s(temp,k=6,fx=F,bs='cr') + rh +
as.factor(dow) + s(fluepi,k=4,fx=F, bs='cr') + as.factor(holiday) + %s",o, v)
gam(as.formula(f),family=quasipoisson, na.action=na.omit,data=mortdf)
})
})
Sample code and data is shown below:
library(quantmod)
library(mgcv)
library(dlnm)
df <- chicagoNMMAPS
outcome<- c("death", "cvd", "resp ")
pollut1<-c("pm10" , "o3" )
ptm <- proc.time()
mod1<- lapply(outcome, function(o){
lapply(pollut1,function(v) {
f <- sprintf("%s~ s(time,bs='cr',k=14*50)+ s(temp,k=6, bs='cr') + as.factor(dow) + %s",o, v)
gam(as.formula(f),family=quasipoisson,na.action=na.omit,data=df)
})})
proc.time() - ptm
user system elapsed
991.02 8.89 1002.00
Upvotes: 0
Views: 206
Reputation: 6146
I'm not familiar with the functionality you're using so what I've produced doesn't actually work (mainly - trying to use the outcomes etc doesn't seem to exist in the sample dataset), but it's hopefully illsutrative.
doParallel (doMC could be used for linux) is like the snowfall package in @f3lix's answer and foreach provides some excellent parallel iterators. I've prepared a list of combinations for checking used foreach here because I think it is a simpler system to use than nesting lapply
. Hopefully, it will provide you with some useful material for doing this calculation.
library(quantmod)
library(mgcv)
library(dlnm)
df <- chicagoNMMAPS
outcome<- c("death", "cvd", "resp ")
pollut1<-c("pm10" , "o3" )
library("doParallel")
library("foreach")
registerDoParallel(cores=8)
combinations<-expand.grid(outcome,pollut1)
mod1<- foreach(o=combinations, .combine='list') %dopar% {
f <- as.formula(paste0("~ s(time,bs='cr',k=14*50)+ s(temp,k=6, bs='cr') + as.factor(dow) + ",o["Var1"], o["Var2"]) )
gam(f,family=quasipoisson,na.action=na.omit,data=df)
}
http://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf
Upvotes: 0
Reputation: 59355
Do you really need 14*50 = 700 dimensions in your basis set? That's the reason it is taking so long. It looks like t ~ O(k^2)
.
# 700 dimensions: 8 minutes, gcv = 1.22
f = death~ s(time,bs='cr',k=14*50)+ s(temp, bs='cr', k=6) + as.factor(dow)+pm10
system.time(g <- gam(f,family=quasipoisson,na.action=na.omit,data=df))
user system elapsed
457.66 2.17 461.90
g$gcv
[1] 1.222779
# 200 dimentsions: 48 seconds; gcv = 1.25
f.new = death~ s(time,bs='cr',k=200)+ s(temp, bs='cr', k=6) + as.factor(dow)+pm10
system.time(g <- gam(f.new,family=quasipoisson,na.action=na.omit,data=df))
user system elapsed
47.93 0.07 48.04
g$gcv
[1] 1.252921
# 100 dimensions: 15 seconds, gcv - 1.30
f.new = death~ s(time,bs='cr',k=100)+ s(temp, bs='cr', k=6) + as.factor(dow)+pm10
system.time(g <- gam(f.new,family=quasipoisson,na.action=na.omit,data=df))
user system elapsed
15.31 0.05 15.39
g$gcv
[1] 1.297332
Upvotes: 1
Reputation: 29877
You can use one of the various R packages to run lapply in parallel on multiple processors. See for example sfLapply()
from the snowfall package. If you were using Linux you could also easily use mclapply()
instead of lapply()
Upvotes: 0