Gurkenkönig
Gurkenkönig

Reputation: 828

How to speed up parallel foreach in R

I want to calculate a series of approx 1.000.000 wilcox.tests in R:

result <- foreach(i = 1:ncol(data), .combine=bind_rows, .multicombine= TRUE, .maxcombine = 1000  ) %do% { 

w = wilcox.test(data[,i]~as.factor(groups),exact = FALSE)

df <- data.frame(Characters=character(),
                   Doubles=double(),
                   Doubles=double(),
                   stringsAsFactors=FALSE)

  df[1,] = c(colnames(data)[i], w$statistic, w$p.value)

  rownames(df) = colnames(beta_t1)[i]
  colnames(df) = c("cg", "statistic", "p.value")

  return(df)

}

If I do it with %dopar% and 15 cores it is slower than with single core %do%. I suspect it is a memory access problem. My processors are hardly used to capacity either. Is it possible to split the data dataframe into chunks and then have each processor calculate 100K and then add them together? How can I speed up this foreach loop?

Upvotes: 0

Views: 601

Answers (1)

Konrad Rudolph
Konrad Rudolph

Reputation: 546213

One thing that’s immediately striking is that you use eight lines to create and return a data.frame where a single expression is sufficient:

data.frame(
    cg = colnames(data)[i],
    statistic = w$statistic,
    p.value = w$p.value
    row.names = colnames(beta_t1)[i]
    stringsAsFactors = FALSE
)

However, the upshot is that after the loop is run, foreach has to row-bind all these data.frames, and that operation is slow. It’s more efficient to return a list of the p-values and statistics and forget about the row and column names (these can be provided afterwards, and then don’t require subsetting and re-concatenation).

That is, change your code to

result = foreach(col = data) %do% {
    w = wilcox.test(col ~ as.factor(groups), exact = FALSE)
    list(w$statistic, w$p.value)
}

# Combine result and transform it into a data.frame:
results = data.frame(
    cg = colnames(data),
    statistic = vapply(results, `[[`, double(1L), 1L),
    p.value = vapply(results, `[[`, double(1L), 2L),
    row.names = colnames(beta_t1),
    stringsAsFactors = FALSE # only necessary for R < 4.0!
)

(I never use foreach so I’m not exactly sure how to use it here but the above should roughly work; otherwise try mclapply from the ‘parallel’ package, it does the same, just using the familiar syntax of lapply.)

Upvotes: 2

Related Questions