nathanf
nathanf

Reputation: 21

parallel regression in R (maybe with snowfall)

I'm trying to run R in parallel to run a regression. I'm trying to use the snowfall library (but am open to any approach). Currently, I'm running the following regression which is taking an extremely long time to run. Can someone show me how to do this?

 sales_day_region_ctgry_lm <- lm(log(sales_out+1)~factor(region_out) 
             + date_vector_out + factor(date_vector_out) +
             factor(category_out) + mean_temp_out)

I've started down the following path:

library(snowfall)
sfInit(parallel = TRUE, cpus=4, type="SOCK")

wrapper <- function() {
return(lm(log(sales_out+1)~factor(region_out) + date_vector_out +
               factor(date_vector_out) + factor(category_out) +   mean_temp_out))
}

output_lm <- sfLapply(*no idea what to do here*,wrapper)
sfStop()
summary(output_lm)

But this approach is riddled with errors.

Thanks!

Upvotes: 2

Views: 2637

Answers (2)

Grant
Grant

Reputation: 1636

The partools package offers an easy, off-the-shelf implementation of parallelised linear regression via its calm() function. (The "ca" prefix stands for "chunk averaging".)

In your case -- leaving aside @Roland's correct comment about mixing up factor and continuous predictors -- the solution should be as simple as:

library(partools)
#library(parallel) ## loads as dependency

cls <- makeCluster(4) ## Or, however many cores you want/have.

sales_day_region_ctgry_calm <- 
  calm(
    cls, 
    "log(sales_out+1) ~ factor(region_out) + date_vector_out + 
     factor(date_vector_out) + factor(category_out) + mean_temp_out, 
     data=YOUR_DATA_HERE"
    )

Note that the model call is described within quotation marks. Note further that you may need to randomise your data first if it is ordered in any way (e.g. by date.) See the partools vignette for more details.

Upvotes: 4

Hong Ooi
Hong Ooi

Reputation: 57697

Since you're fitting one big model (as opposed to several small models), and you're using linear regression, a quick-and-easy way to get parallelism is to use a multithreaded BLAS. Something like Microsoft R Open (previously known as Revolution R Open) should do the trick.*

* disclosure: I work for Microsoft/Revolution.

Upvotes: 3

Related Questions