Reputation: 51
An automated system is used to launch an R script that uses makeCluster to open a cluster of 35 nodes on a machine with 36 CPUs. (AWS c4.8xlarge running up to date Ubuntu and R)
n.nodes = 35
cl <- makeCluster(n.nodes,
outfile = "debug.txt")
The following error as written to debug.txt appears on a somewhat regular basis
starting worker pid=2017 on localhost:11823 at 21:15:57.390
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", :
cannot open the connection
Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> socketConnection
In addition: Warning message:
In socketConnection(master, port = port, blocking = TRUE, open = "a+b", :
localhost:11823 cannot be opened
Execution halted
The pid and port number are session specific. The program fails to proceed when this error is encountered.
Question 1: Are there error handling methods that will recognize this has happened and try to make the cluster again?
Note: The following does not work
attempt=0
while(dim(showConnections())[1] < n.nodes && attempt<=25){ # 25 chancees to create n.nodes connections
print(attempt)
closeAllConnections() # Close any open connections
portnum = round(runif(1,11000,11998)) # Randomly Choose a Port
tryCatch({ # Try to create the cluster
evalWithTimeout({
cl <- makeCluster(n.nodes,
outfile = "debug.txt",
port=portnum)
},timeout = 120) # Give it two minutes and then stop trying
},TimeoutException = function(x) {print(paste("Failed to Create Cluster",portnum))}) # If it fails, print the portnum it tried
attempt=attempt+1 # Update attempt
Sys.sleep(2) # Take a breather
}
Question 2: If there is not a way to automatically retry making the cluster, is there a way to check if the port can be opened before attempting to run makeCluster?
Note: This system must be fully automated/self contained. It must recognize an error, handle/fix the problem, and then proceed without manual intervention.
Upvotes: 0
Views: 1026
Reputation: 6805
parallel::makeCluster()
, or parallel::makePSOCKcluster()
used internally here, does not provide any automatic retrial. If you look at the code of parallel::makePSOCKcluster()
you implement your own version based on parallel:::newPSOCKnode()
that sets up each individual worker. That's a internal function, so it should be considered a "hack".
In the future package (I'm the author), there is future::makeClusterPSOCK()
with companion future::makeNodePSOCK()
- both are part of the public API. That provides you with building blocks to run your own improved version. Also, you can write your own function myCreateNode()
to setup a cluster node that retries and pass use it as cl <- future::makeClusterPSOCK(..., makeNode = myCreateNode)
. Sorry, that's all I have time for right now.
Upvotes: 2