Tomas Greif
Tomas Greif

Reputation: 22623

Parallel predict

I am trying to run predict() in parallel on my Windows machine. This works on smaller dataset, but does not scale well as for each process new copy of data frame is created. Is there a way how to run in parallel without making temporary copies?

My code (only few modifications of this original code):

library(foreach)
library(doSNOW)

fit <- lm(Employed ~ ., data = longley)
scale <- 100
longley2 <- (longley[rep(seq(nrow(longley)), scale), ])

num_splits <-4
cl <- makeCluster(num_splits)
registerDoSNOW(cl)  

split_testing<-sort(rank(1:nrow(longley))%%num_splits)

predictions<-foreach(i= unique(split_testing),
                     .combine = c, .packages=c("stats")) %dopar% {
                       predict(fit, newdata=longley2[split_testing == i, ])
                     }
stopCluster(cl)

I am using simple data replication to test it. With scale 10 or 1000 it is working, but I would like to make it run with scale <- 1000000 - data frame with 16M rows (1.86GB data frame as indicated by object_size() from pryr. Note that when necessary I can also use Linux machine, if this is the only option.

Upvotes: 7

Views: 2338

Answers (1)

Steve Weston
Steve Weston

Reputation: 19667

You can use the isplitRows function from the itertools package to send only the fraction of longley2 that is needed for the task:

library(itertools)

predictions <-
  foreach(d=isplitRows(longley2, chunks=num_splits),
          .combine=c, .packages=c("stats")) %dopar% {
    predict(fit, newdata=d)
  }

This prevents the entire longley2 data frame from being automatically exported to each of the workers and simplifies the code a bit.

Upvotes: 8

Related Questions