Nikki
Nikki

Reputation: 3

Parallelizing the R code using mclapply does not generate the correct results

I have a df and need to apply a function that gives a score to each columns (calc.fitness):

    df
  #         ch1     ch2   ch3    ch4   ch5    ch6   ch7   ch8
  # g1       5      2      7     10     7     10    10    6
  # g2       1      4      5      4     1     2     5     4
  # g3      16      14     7      4     2     2     8     7
  # g4       7      5      5      3     2     5     1     6
  # g5       7      2      1      3     7     2     4     1
  # g6       4      7      11     4     9     3     9     14
  # g7      12      8      6      7     5     9     7     4
  # g8       4      2      3      2     2     4     1     1
  # g9       1      2      1      1     2     1     2     1

using , I will get the following results which is the correct one but very time consuming as size of df increases:

sapply(as.list(df), calc.fitness,filterTable=my.df)
#     ch1          ch2            ch3            ch4            ch5            ch6            ch7             ch8 
# 8.481359e-02  6.419552e-01   5.847587e-02   6.713477e-02   1.552056e-01   1.305787e+34   2.805074e-01    2.039931e+00 

I used [Tag:mclapply` to make it faster as follows:

numCores <- detectCores()
result <- unlist(mclapply(1:8, function(x) {
  return(calc.fitness(df[,x], filterTable=my.df))}, mc.preschedule = TRUE, mc.cores = numCores))

# result
# [1] 8.481359e-02 8.481359e-02 8.481359e-02 8.481359e-02 1.305787e+34 1.305787e+34 1.305787e+34 1.305787e+34

But as results show, mclapply does not work correctly and I do not know what is the problem and how to fix. I really appreciate any help!

PS: calc.fitness is a long method, I tried to make it shorter here:

calc.fitness <- function(df.val, filterTable = my.df) {
  input.path <- "/home/Nikki/Desktop/v2017.0/exec/Input_2017.txt"
  filterTable$xe <-  df.val[1]
  filterTable$xth <- df.val[2]
  filterTable$xfi <- df.val[3]
  filterTable$xfw <- df.val[4]
  filterTable$xfm <- df.val[5]
  filterTable$xls <- df.val[6]
  filterTable$xhls <- df.val[7]
  filterTable$xvt <- df.val[8]
  filterTable$xvd <- df.val[9]
  write.fwf(filterTable,append = TRUE,file = paste("Input_2017", ".txt", sep = ""),width = 25, rownames = F,colnames = F,quote = F)
  command <- "wine  /home/Nikki/Desktop/v2017.0/exec/2017File.exe"
  system(command)
  output.file <-read.table("/home/Nikki/Desktop/v2017.0/exec/Output_2017.txt",header = TRUE,fill = TRUE)
  output.pgt <- as.numeric(levels(output.file$pgt))[output.file$pgt]
  calc.sol <- output.pgt[!is.na(output.pgt)]
  opt.sol <- filterTable$PressureDropGL
  n <- length(calc.sol)
  subtract.val <- calc.sol - opt.sol
  denominator <- opt.sol
  sq.output <-  (subtract.val / denominator) ^ 2
  fitness.val <- sum(sq.output) / n
  return(fitness.val)
}# end of function

my.df:

enter image description here

Appreciate your help.

Upvotes: 0

Views: 898

Answers (2)

francois artin
francois artin

Reputation: 36

If it works with sapply but not with mclapply, it is surely because sapply and lapply slightly differ, and what you would like to use is something like mcsapply instead of mclapply.

If it is the case, you will find an implementation of mcsapply in following duplicate answer, that make use extensive use of in my code :
multicore::sapply?

I guess this question is a duplicate of this one by the way

Upvotes: 0

user12728748
user12728748

Reputation: 8506

sapply simplifies to a matrix, while unlisting your list of columns returns the vectors of each column, one after the other. Consider using the cumsum function as illustration:

df <-
    structure(
        list(
            ch1 = c(5L, 1L, 16L, 7L, 7L, 4L, 12L, 4L, 1L),
            ch2 = c(2L, 4L, 14L, 5L, 2L, 7L, 8L, 2L, 2L),
            ch3 = c(7L, 5L, 7L, 5L, 1L, 11L, 6L, 3L, 1L),
            ch4 = c(10L, 4L, 4L, 3L, 3L, 4L, 7L, 2L, 1L),
            ch5 = c(7L, 1L, 2L, 2L, 7L, 9L, 5L, 2L, 2L),
            ch6 = c(10L, 2L, 2L, 5L, 2L, 3L, 9L, 4L, 1L),
            ch7 = c(10L, 5L, 8L, 1L, 4L, 9L, 7L, 1L, 2L),
            ch8 = c(6L, 4L, 7L, 6L, 1L, 14L, 4L, 1L, 1L)
        ),
        class = "data.frame",
        row.names = c("g1", "g2", "g3", "g4", "g5", "g6", "g7", "g8", "g9")
    )

sapply(as.list(df), cumsum)
#>       ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
#>  [1,]   5   2   7  10   7  10  10   6
#>  [2,]   6   6  12  14   8  12  15  10
#>  [3,]  22  20  19  18  10  14  23  17
#>  [4,]  29  25  24  21  12  19  24  23
#>  [5,]  36  27  25  24  19  21  28  24
#>  [6,]  40  34  36  28  28  24  37  38
#>  [7,]  52  42  42  35  33  33  44  42
#>  [8,]  56  44  45  37  35  37  45  43
#>  [9,]  57  46  46  38  37  38  47  44

unlist(parallel::mclapply(1:8, function(x) {
    return(cumsum(df[,x]))}, mc.preschedule = TRUE, mc.cores = 4L))
#>  [1]  5  6 22 29 36 40 52 56 57  2  6 20 25 27 34 42 44 46  7 12 19 24 25 36 42
#> [26] 45 46 10 14 18 21 24 28 35 37 38  7  8 10 12 19 28 33 35 37 10 12 14 19 21
#> [51] 24 33 37 38 10 15 23 24 28 37 44 45 47  6 10 17 23 24 38 42 43 44

do.call(cbind, parallel::mclapply(1:8, function(x) {
    return(cumsum(df[,x]))}, mc.preschedule = TRUE, mc.cores = 4L))
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#>  [1,]    5    2    7   10    7   10   10    6
#>  [2,]    6    6   12   14    8   12   15   10
#>  [3,]   22   20   19   18   10   14   23   17
#>  [4,]   29   25   24   21   12   19   24   23
#>  [5,]   36   27   25   24   19   21   28   24
#>  [6,]   40   34   36   28   28   24   37   38
#>  [7,]   52   42   42   35   33   33   44   42
#>  [8,]   56   44   45   37   35   37   45   43
#>  [9,]   57   4618   46   38   37   38   47   44

Created on 2020-03-25 by the reprex package (v0.3.0)

Edit: After seeing your function, you are appending data generated in it to one file. That may work fine if done sequentially, but when you do that in a parallel processes you are bound to run into trouble. Spawning multiple wine processes in parallel by itself may also not be the most efficient procedure to begin with, even if it yielded the correct results (profiling your (linear) code with the profvis package would show you the bottleneck). Is there any alternative to the 2017File.exe to calculate the fitness.val?

If your plan was truly to sequentially append results from columns, then to properly initiate the parallel generation of results with your exe file, you may have to save unique instances of the sequentially growing file (your write.fwf command) and then pass those in parallel to the exe command, generating unique output.txt files for each sequential step, and then load the results from that in the correct order.

Upvotes: 1

Related Questions