Big Data
Big Data

Reputation: 15

How to do parallel in R?

I'm reading csv file in directory with more than 100 files, then I'm doing some stuff, I have 8 cores cpu so I want to do in parallel mode to finish faster.

I wrote some code but it doesn't work for me - (using linux)

library(data.table)
library(parallel)

# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores)

processFile <- function(f) {

  # reading file by data.table 
  df <- fread(f,colClasses = c(NA,NA, NA,"NULL", "NULL", "NULL"))

  A <- parLapply(cl,sapply(windows, function(w) {return(numOverlaps(w,df))}))

  stopCluster(cl)
}

files <- dir("/home/shared/", recursive=TRUE, full.names=TRUE, pattern=".*\\.txt$")

# Apply the function to all files.

 result <- sapply(files, processFile)

As you see I want to run function in processFile(A) but it doesn't work!

How it's possible to run that function in parallel processing mode?

Upvotes: 0

Views: 699

Answers (1)

Roman Luštrik
Roman Luštrik

Reputation: 70653

You have the concept on its head. You need to pass parLapply the list of files and then work on them. The anonymous function should do the entire process of processing the individual file and returning the desired result.

My suggestion would be to first make this work using regular lapply or sapply and only then power up parallel backend, export all necessary libraries and objects you may need.

parLapply(cl, X = files, FUN = function(x, ...) {
  ... code for processing the file
})

Upvotes: 2

Related Questions