mrip
mrip

Reputation: 15163

Performance of clusterApply deteriorates when called inside a function

I've come across a strange issue with clusterApply, which I've been able to isolate as best as I can as follows. First, I run the following code from the global environement:

require(parallel)
cl<-makeCluster(rep("localhost",20),"SOCK")
xl<-list()
for(i in 1:20)
  xl[[i]]<-crossprod(matrix(rnorm(1e6),1000,1000))
x<-xl
clusterExport(cl,"x",environment())
f0<-function(z) eigen(x[[z]])
system.time(clusterApply(cl,1:20,f0))
##    user  system elapsed 
##   0.332   0.264   3.334 

Now, to make sure nothing weird is going on, restart R, and now run this similar code, which calls clusterApply from inside a function:

require(parallel)
cl<-makeCluster(rep("localhost",20),"SOCK")
xl<-list()
for(i in 1:20)
  xl[[i]]<-crossprod(matrix(rnorm(1e6),1000,1000))
f<-function(clust,x){
  force(x)
  clusterExport(clust,"x",environment())
  f0<-function(z) eigen(x[[z]])
  print(system.time(clusterApply(clust,1:20,f0)))
}
f(cl,xl)
##   user  system elapsed 
##  5.212   1.888  13.627 

I did some searching and found this answer to a related question, which points out that local variables used in functions which are not defined in the global environment are exported to the cluster. So I thought, maybe the problem is that x is getting exported twice, and that's what's taking a long time, not the actual function call. To test this I changed the function definition to:

f0<-function(z) eigen(get("x")[[z]])

and I still got the slow performance. Does anyone know what might be going on here?

Incidentally, if I just call

clusterApply(clust,x,eigen)

inside the function, then it works fine, just as fast as if it were in the global environment. And of course, if this were the problem I'm trying to solve, I'd simply do that, but it's not, this is just a toy problem to isolate the issue I'm having with other more complicated code.

Upvotes: 5

Views: 1372

Answers (1)

Steve Weston
Steve Weston

Reputation: 19677

Your performance is indeed hurt because the variable x is being sent along with the f0 function in every task. Changing the way that f0 refers to x makes no difference: the problem has nothing to do with how f0 uses x or whether it refers to x at all. It has to do with where f0 itself is defined. If you defined it outside of f such that the associated environment of f0 was the global environment, then your problem would be fixed.

If you want to define f0 inside f, you can fix it by modifying the environment of f0 after you've defined it:

f<-function(clust,x){
  force(x)
  clusterExport(clust,"x",environment())
  f0<-function(z) eigen(x[[z]])
  environment(f0) <- .GlobalEnv
  print(system.time(clusterApply(clust,1:20,f0)))
}

This fixes the problem because the global environment is never serialized along with functions.

The reason that clusterApply(clust,x,eigen) works well is that eigen is not defined in f, so x is not captured when eigen is serialized.

Upvotes: 6

Related Questions