Reputation: 13113
I have been able to run 20 models simultaneously using a r6a.48xlarge
Amazon Web Services
instance (192 vCPUs
, 1536.00 GiB
memory) and this R
code:
setwd('/home/ubuntu/')
library(doParallel)
detectCores()
my.AWS.n.cores <- detectCores()
my.AWS.n.cores <- my.AWS.n.cores - 92
my.AWS.n.cores
registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
folderName <- 'model000222'
files <- list.files(folderName, full.names=TRUE)
start.time <- Sys.time()
foreach(file = files, .errorhandling = "remove") %dopar% {
source(file)
}
stopCluster(my.cluster)
end.time <- Sys.time()
total.time.c <- end.time-start.time
total.time.c
However, the above R
code did not run until I reduced the number of cores
to 100
from 192
with this line:
my.AWS.n.cores <- my.AWS.n.cores - 92
If I tried running the code with all 192 vCPUs
or 187 vCPUs
I got this error message
:
> my.AWS.n.cores <- detectCores()
> my.AWS.n.cores <- my.AWS.n.cores - 5
> my.AWS.n.cores
[1] 187
>
> registerDoParallel(my.cluster <- makeCluster(my.AWS.n.cores))
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
all connections are in use
Calls: registerDoParallel ... makePSOCKcluster -> newPSOCKnode -> socketConnection
I had never seen that error message
and could not locate it with an internet search. Could someone explain this error message
? I do not know why my solution worked or whether a better solution exists. Can I easily determine the maximum number of connections
I can use without getting this error
? I suppose I could run the code incrementing the number of cores from 100 to 187.
I installed R
on this instance
with the lines below in PuTTY
. R
could not be located on the instance
until I used the last line below: apt install r-base-core
.
sudo su
echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"
sudo apt-get update
sudo apt-get install r-base
sudo apt install dos2unix
apt install r-base-core
I used this AMI:
Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
EDIT
Apparently, R
has a hardwired limit of 128 connections
. Apparently, you can increase the number of PSOCK workers
manually if you are willing to rebuild R
from source
but I have not found an answer showing how to do that. Ideally I can find an answer showing how to do that with Ubuntu
and AWS
. See also these previous related questions.
Errors in makeCluster(multicore): cannot open the connection
Is there a limit on the number of slaves that R snow can create?
Upvotes: 2
Views: 1266
Reputation: 6805
Each parallel PSOCK worker consumes one R connection. As of R 4.2.1, R is hard-coded to support only 128 open connections at any time. Three of those connections are always in use (stdin, stdout, and stderr), leaving you with 125 to play with.
To increase this limit, you have to update constant:
#define NCONNECTIONS 128
in src/main/connections.c, and then re-build R from source. FWIW, I've verified that it works with at least 16,384 on Ubuntu 16.04 (https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28#issuecomment-231603035).
People have reported on this before, and the problem has been raised on R-devel several times over the years. Last time the limit was increased was in R 2.4.0 (October 2008) when it was increased from 50 to 128. See https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28 for more details and discussions. I think it's worth bumping this topic again on R-devel. As people get access to more cores, more people will run into this problem.
The parallelly package provides two functions, availableConnections()
and freeConnections()
, for querying the current R installation for number of connections available and free. See https://parallelly.futureverse.org/reference/availableConnections.html details and examples.
FYI, if you use parallelly::makeClusterPSOCK(n)
instead of parallel::makeCluster(n)
, you'll get a more informative error message, and much sooner, e.g.
> cl <- parallelly::makeClusterPSOCK(192)
Error: Cannot create 192 parallel PSOCK nodes. Each node
needs one connection but there are only 124 connections left
out of the maximum 128 available on this R installation
You can avoid relying on R connections for local parallel processing, by using the callr package under the hood. The easiest way to achieve this is to use doFuture in combination with future.callr. In your example, that would be:
library(doFuture)
library(future.callr)
registerDoFuture()
plan(callr, workers = parallelly::availableCores(omit = 5))
...
With this setup, the parallel workers are launched via callr (which operates without R connections). Each parallel task is launched in a separate callr process and when the task completes, the parallel worker is terminated. Because the parallel workers are not reused, there is an extra overhead paid for using the callr backend, but if your parallel tasks are long enough, that should still be a minor part of the processing time.
Upvotes: 7