Which R package is suitable for parallel computing? How to speed up the given code?

Question

Which R package is the best for parallel computing?

I am interested in recommendations focused on processing time/vectorisation/user friendly syntax (looking for information valid in 01.2022 (including novelties)).

And here is the second part of the question:

I'd like to iterate through many (thousands) of vectors in a large list, so I'd like to speed it up.

Here is an example of what I can do. Inside the for loop I put some simple rescaling anonymous function just to provide the reprex. IRL I have more complicated computation to do. Can something lile this be speed up? How?

```{r}
library(foreach)
library(doParallel)

#this is dummy list of vecs:
set.seed(123)

v1 <- sample(600:800, 108, replace=TRUE)
v2 <- sample(600:800, 120, replace=TRUE)
v3 <- sample(550:800, 200, replace=TRUE)
v4 <- sample(640:800, 120, replace=TRUE)
v5 <- sample(700:810, 131, replace=TRUE)
v6 <- sample(600:800, 220, replace=TRUE)
v7 <- sample(600:850, 149, replace=TRUE)
v8 <- sample(530:800, 144, replace=TRUE)
v9 <- sample(600:810, 129, replace=TRUE)
v10 <- sample(600:860, 170, replace=TRUE)

list1 <- list()

list1[["first"]] <- v1
list1[["named"]] <- v2
list1[["vector"]] <- v3
list1[["out"]] <- v4
list1[["of"]] <- v5
list1[["many"]] <- v6
list1[["within"]] <- v7
list1[["this"]] <- v8
list1[["dummy"]] <- v9
list1[["list"]] <- v10


#this function rescales vectors within a given list to 0-255 range

parallelism_test <- function(list){
  # split list into chunks
  index_list = split(1:length(names(list1)), ceiling(1:length(names(list1))/100))
  
  #create empty list for extracted data
  newlist = list()
  
  # loop for parallel extraction
  for (indx in 1:length(index_list)){
    # use selected number of cores
    doParallel::registerDoParallel(cores = 5)
    # work in parallel (rescale vector to 0-255 range)
    newlist <- c(newlist, foreach::foreach(nam = names(list1)[index_list[[indx]]]) %dopar% (function (x) {(x - min(x))* (1/(max(x) - min(x))*255)})(list[[nam]]))}

  
  return(newlist)
}


test<- parallelism_test(list1)
print(test)
```

I will appreciate any advice.

Jan van der Laan · Accepted Answer

There are quite a number of packages for parallel computing, but I often prefer the builtin parallel package.

In this case to code is quite clean

vscale <- function(x) {
  (x - min(x))/(max(x) - min(x))*255
}

library(parallel)
cl <- makeCluster(4)

list2 <- parLapply(cl, list1, vscale)

Although in this specific, I suspect that, storing your data in, for example, a data.table and use that, is in many cases faster. data.table will also use multiple threads if necessary. In the example below, I first create the data.table from the original list. It would, of course, be better in this case to make sure the data ends up in a data.table from the beginning.

library(data.table)

dta <- lapply(names(list1), 
  function(col) data.table(group = col, value = list1[[col]]))
dta <- rbindlist(dta)

dta[, value := vscale(value)]

Which R package is suitable for parallel computing? How to speed up the given code?

Answers (1)

Related Questions