doParallel and foreach fails to parallelise merge operation

Question

I am trying to merge a large data.frame with a small one, and parallelise the computation. The code below works perfect, maximising all cores of my machine:

len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 3-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <-  data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
    tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
    merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)

Once I change vector dat to contain strings that are 5-long, parallelism breaks down, and although there is no error or warning, only 1 core is contributing to the computation:

len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 6-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <-  data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
    tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
    merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)

Why this inconsistency, and how could one work around it? In the particular example, if one indexes dat to integers the code works. But indexing is not the answer in all cases. Why would the length of the strings matter to the number of cores utilised whatsoever?

Steve Weston · Accepted Answer

I believe the difference is that in the first case, the first column of "bigDF" is a factor with 6,760 levels, while in the second case it has 1,983,234 levels. Having a huge number of levels can cause a number of performance problems. When I created "bigDF" with stringsAsFactors=FALSE, the performance was much better.

bigDF <- data.frame(dat=dat, num=num, stringsAsFactors=FALSE)

I also used the "isplitRows" function from the itertools package to avoid sending all of "bigDF" to each of the workers:

library(itertools)
mergedList <- foreach(splitDF=isplitRows(bigDF, chunkSize=chunk)) %dopar% {
    merge(splitDF, smallDF, by = 'num', all.x = T)
}

On my 6 core Linux machine running R 3.1.1, your second example ran in about 332 seconds. When I used stringsAsFactors=FALSE, it ran in about 50 seconds. When I also used isplitRows, the time went down to 5.5 seconds, or about 60 times faster than your second example.

doParallel and foreach fails to parallelise merge operation

Answers (2)

Related Questions