Matthya C
Matthya C

Reputation: 83

Convert R apply statement to lapply for parallel processing

I have the following R "apply" statement:

for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
    matrix_of_sums[,i]<-
    apply(simulation_results[,colnames(simulation_results) %in% 
    dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}

So, I have the following data structures:

simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).

dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.

matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.

So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.

Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?

Thanks!

Upvotes: 4

Views: 15644

Answers (2)

Carl Boneri
Carl Boneri

Reputation: 2722

without really having any applicable or sample data to go off of... the process would look like this:

  • Create a holding matrix(matrix_of_sums)
  • loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
  • find matching indices within the simulation model(simulation_results)
  • bind the rowSums into the holding matrix(matrix of sums)

I recreated a sample set which is meaningless and produces identical results but should work for your data

# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
    parallel::mclapply(1:nrow(ts_df), function(i){
       # Store the row to its own variable for ease
       d <- ts_df[i,]
       # cbind the results using the global assignment operator `<<-`
       msums <<- cbind(
                 msums, 
                 rowSums(
                    sim_df[,which(colnames(sim_df) %in% colnames(d))]
            ))
    }, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

Upvotes: 0

CPak
CPak

Reputation: 13581

With base R parallel, try

library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
    rowSums(simulation_results[,colnames(simulation_results) %in% 
        dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)

You could also try foreach %dopar%

library(doParallel)  # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
    rowSums(simulation_results[,colnames(simulation_results) %in% 
    dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)

I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.

Upvotes: 7

Related Questions