Reputation: 828
I want to calculate a series of approx 1.000.000 wilcox.tests in R:
result <- foreach(i = 1:ncol(data), .combine=bind_rows, .multicombine= TRUE, .maxcombine = 1000 ) %do% {
w = wilcox.test(data[,i]~as.factor(groups),exact = FALSE)
df <- data.frame(Characters=character(),
Doubles=double(),
Doubles=double(),
stringsAsFactors=FALSE)
df[1,] = c(colnames(data)[i], w$statistic, w$p.value)
rownames(df) = colnames(beta_t1)[i]
colnames(df) = c("cg", "statistic", "p.value")
return(df)
}
If I do it with %dopar% and 15 cores it is slower than with single core %do%. I suspect it is a memory access problem. My processors are hardly used to capacity either. Is it possible to split the data dataframe into chunks and then have each processor calculate 100K and then add them together? How can I speed up this foreach loop?
Upvotes: 0
Views: 601
Reputation: 546213
One thing that’s immediately striking is that you use eight lines to create and return a data.frame where a single expression is sufficient:
data.frame(
cg = colnames(data)[i],
statistic = w$statistic,
p.value = w$p.value
row.names = colnames(beta_t1)[i]
stringsAsFactors = FALSE
)
However, the upshot is that after the loop is run, foreach
has to row-bind all these data.frames, and that operation is slow. It’s more efficient to return a list of the p-values and statistics and forget about the row and column names (these can be provided afterwards, and then don’t require subsetting and re-concatenation).
That is, change your code to
result = foreach(col = data) %do% {
w = wilcox.test(col ~ as.factor(groups), exact = FALSE)
list(w$statistic, w$p.value)
}
# Combine result and transform it into a data.frame:
results = data.frame(
cg = colnames(data),
statistic = vapply(results, `[[`, double(1L), 1L),
p.value = vapply(results, `[[`, double(1L), 2L),
row.names = colnames(beta_t1),
stringsAsFactors = FALSE # only necessary for R < 4.0!
)
(I never use foreach
so I’m not exactly sure how to use it here but the above should roughly work; otherwise try mclapply
from the ‘parallel’ package, it does the same, just using the familiar syntax of lapply
.)
Upvotes: 2