Reputation: 322
I am trying to merge a large data.frame
with a small one, and parallelise the computation. The code below works perfect, maximising all cores of my machine:
len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 3-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <- data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)
Once I change vector dat
to contain strings that are 5-long, parallelism breaks down, and although there is no error or warning, only 1 core is contributing to the computation:
len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 6-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <- data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)
Why this inconsistency, and how could one work around it? In the particular example, if one indexes dat
to integers the code works. But indexing is not the answer in all cases. Why would the length of the strings matter to the number of cores utilised whatsoever?
Upvotes: 4
Views: 889
Reputation: 21502
Not an answer yet, but:
If I run your code but using %do%
so as not to parallelize, I get identical (successful) results for the two cases except of course for the dat
names. Same if I run the short names with %dopar%
and the long names with %do%
.
This is beginning to look like a subtle bug in one of the supporting packages, so you might want to ping the developers on this one.
Update 29Sept: I ran what I believe is the same setup but using ClusterMap:
dffunc <-function(i=i,bigDF=bigDF,smallDF=smallDF,startP=startP,chunk=chunk) {
tmpDF <- bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
merge(tmpDF, smallDF, by = 'num', all.x = T)
}
clusmerge<- clusterMap(cl, function(i) {dffunc(i=i)}, 0:(len/chunk-1),MoreArgs=list(bigDF=bigDF,smallDF=smallDF,startP=startP,chunk=chunk) )
And in this case I get all the nodes up and running regardless of the length of the dat
name strings. I'm back to suspecting there's some bug in %dopar%
or elsewhere in the foreach
package.
As a side note, may I recommend against doing
nodes <- detectCores()
cl <- makeCluster(nodes)
As that can hang your entire machine. Better cl <- makeCluster(nodes-1)
:-)
Upvotes: 2
Reputation: 19677
I believe the difference is that in the first case, the first column of "bigDF" is a factor with 6,760 levels, while in the second case it has 1,983,234 levels. Having a huge number of levels can cause a number of performance problems. When I created "bigDF" with stringsAsFactors=FALSE
, the performance was much better.
bigDF <- data.frame(dat=dat, num=num, stringsAsFactors=FALSE)
I also used the "isplitRows" function from the itertools package to avoid sending all of "bigDF" to each of the workers:
library(itertools)
mergedList <- foreach(splitDF=isplitRows(bigDF, chunkSize=chunk)) %dopar% {
merge(splitDF, smallDF, by = 'num', all.x = T)
}
On my 6 core Linux machine running R 3.1.1, your second example ran in about 332 seconds. When I used stringsAsFactors=FALSE
, it ran in about 50 seconds. When I also used isplitRows, the time went down to 5.5 seconds, or about 60 times faster than your second example.
Upvotes: 4