Mark Miller
Mark Miller

Reputation: 13113

'all connections are in use' with parallel processing on AWS

I have been able to run 20 models simultaneously using a r6a.48xlarge Amazon Web Services instance (192 vCPUs, 1536.00 GiB memory) and this R code:

setwd('/home/ubuntu/')

library(doParallel)

detectCores()

my.AWS.n.cores <- detectCores()
my.AWS.n.cores <- my.AWS.n.cores - 92
my.AWS.n.cores

registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))


folderName <- 'model000222'


files <- list.files(folderName, full.names=TRUE)

start.time <- Sys.time()

foreach(file = files, .errorhandling = "remove") %dopar% {
  source(file)
}

stopCluster(my.cluster)

end.time <- Sys.time()
total.time.c <- end.time-start.time
total.time.c

However, the above R code did not run until I reduced the number of cores to 100 from 192 with this line:

my.AWS.n.cores <- my.AWS.n.cores - 92

If I tried running the code with all 192 vCPUs or 187 vCPUs I got this error message:

> my.AWS.n.cores <- detectCores()
> my.AWS.n.cores <- my.AWS.n.cores - 5
> my.AWS.n.cores
[1] 187
> 
> registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  : 
  all connections are in use
Calls: registerDoParallel ... makePSOCKcluster -> newPSOCKnode -> socketConnection

I had never seen that error message and could not locate it with an internet search. Could someone explain this error message? I do not know why my solution worked or whether a better solution exists. Can I easily determine the maximum number of connections I can use without getting this error? I suppose I could run the code incrementing the number of cores from 100 to 187.

I installed R on this instance with the lines below in PuTTY. R could not be located on the instance until I used the last line below: apt install r-base-core.

sudo su
echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"
sudo apt-get update
sudo apt-get install r-base
sudo apt install dos2unix
apt install r-base-core

I used this AMI:

Ubuntu Server 18.04 LTS (HVM), SSD Volume Type 

EDIT

Apparently, R has a hardwired limit of 128 connections. Apparently, you can increase the number of PSOCK workers manually if you are willing to rebuild R from source but I have not found an answer showing how to do that. Ideally I can find an answer showing how to do that with Ubuntu and AWS. See also these previous related questions.

Errors in makeCluster(multicore): cannot open the connection

Is there a limit on the number of slaves that R snow can create?

Upvotes: 2

Views: 1266

Answers (1)

HenrikB
HenrikB

Reputation: 6805

Explanation

Each parallel PSOCK worker consumes one R connection. As of R 4.2.1, R is hard-coded to support only 128 open connections at any time. Three of those connections are always in use (stdin, stdout, and stderr), leaving you with 125 to play with.

To increase this limit, you have to update constant:

#define NCONNECTIONS 128

in src/main/connections.c, and then re-build R from source. FWIW, I've verified that it works with at least 16,384 on Ubuntu 16.04 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28#issuecomment-231603035).

People have reported on this before, and the problem has been raised on R-devel several times over the years. Last time the limit was increased was in R 2.4.0 (October 2008) when it was increased from 50 to 128. See https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28 for more details and discussions. I think it's worth bumping this topic again on R-devel. As people get access to more cores, more people will run into this problem.

The parallelly package provides two functions, availableConnections() and freeConnections(), for querying the current R installation for number of connections available and free. See https://parallelly.futureverse.org/reference/availableConnections.html details and examples.

FYI, if you use parallelly::makeClusterPSOCK(n) instead of parallel::makeCluster(n), you'll get a more informative error message, and much sooner, e.g.

> cl <- parallelly::makeClusterPSOCK(192)
Error: Cannot create 192 parallel PSOCK nodes. Each node
needs one connection but there are only 124 connections left
out of the maximum 128 available on this R installation

Workaround

You can avoid relying on R connections for local parallel processing, by using the callr package under the hood. The easiest way to achieve this is to use doFuture in combination with future.callr. In your example, that would be:

library(doFuture)
library(future.callr)

registerDoFuture()
plan(callr, workers = parallelly::availableCores(omit = 5))

...

With this setup, the parallel workers are launched via callr (which operates without R connections). Each parallel task is launched in a separate callr process and when the task completes, the parallel worker is terminated. Because the parallel workers are not reused, there is an extra overhead paid for using the callr backend, but if your parallel tasks are long enough, that should still be a minor part of the processing time.

Upvotes: 7

Related Questions