user1325068
user1325068

Reputation: 51

makeCluster cannot open the connection -- error handling strategies?

An automated system is used to launch an R script that uses makeCluster to open a cluster of 35 nodes on a machine with 36 CPUs. (AWS c4.8xlarge running up to date Ubuntu and R)

n.nodes = 35
cl <- makeCluster(n.nodes,
                  outfile = "debug.txt")

The following error as written to debug.txt appears on a somewhat regular basis

    starting worker pid=2017 on localhost:11823 at 21:15:57.390
    Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
    cannot open the connection
    Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> socketConnection
    In addition: Warning message:
    In socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
    localhost:11823 cannot be opened
    Execution halted

The pid and port number are session specific. The program fails to proceed when this error is encountered.

Question 1: Are there error handling methods that will recognize this has happened and try to make the cluster again?

Note: The following does not work

attempt=0
while(dim(showConnections())[1] < n.nodes && attempt<=25){ # 25 chancees to create n.nodes connections
print(attempt)
closeAllConnections() # Close any open connections
portnum = round(runif(1,11000,11998)) # Randomly Choose a Port
tryCatch({ # Try to create the cluster
    evalWithTimeout({
        cl <- makeCluster(n.nodes,
                    outfile = "debug.txt",
                    port=portnum)
        },timeout = 120) # Give it two minutes and then stop trying
      },TimeoutException = function(x) {print(paste("Failed to Create Cluster",portnum))}) # If it fails, print the portnum it tried
      attempt=attempt+1 # Update attempt
      Sys.sleep(2) # Take a breather
    }

Question 2: If there is not a way to automatically retry making the cluster, is there a way to check if the port can be opened before attempting to run makeCluster?

Note: This system must be fully automated/self contained. It must recognize an error, handle/fix the problem, and then proceed without manual intervention.

Upvotes: 0

Views: 1026

Answers (1)

HenrikB
HenrikB

Reputation: 6805

parallel::makeCluster(), or parallel::makePSOCKcluster() used internally here, does not provide any automatic retrial. If you look at the code of parallel::makePSOCKcluster() you implement your own version based on parallel:::newPSOCKnode() that sets up each individual worker. That's a internal function, so it should be considered a "hack".

In the future package (I'm the author), there is future::makeClusterPSOCK() with companion future::makeNodePSOCK() - both are part of the public API. That provides you with building blocks to run your own improved version. Also, you can write your own function myCreateNode() to setup a cluster node that retries and pass use it as cl <- future::makeClusterPSOCK(..., makeNode = myCreateNode). Sorry, that's all I have time for right now.

Upvotes: 2

Related Questions