Mihai
Mihai

Reputation: 2927

Why are the workers failing to connect when calling `makePSOCKcluster` from an external `R` process?

I am experiencing a problem already raised before here and here, but in a different context.

Suppose I have an R script file called psock.R that contains the following code:

cat("Setup cluster...\n")
cluster <- parallel::makePSOCKcluster(
    rep("localhost", 2),
    master = "localhost",
    port = 11234,
    manual = FALSE,
    outfile = ""
)

cat("Sleep...\n")
Sys.sleep(1)

cat("Teardown cluster...\n")
parallel::stopCluster(cluster)

When I run the script via Rscript --vanilla psock.R, everything works as expected, and I see:

Setup cluster...
starting worker pid=11557 on localhost:11234 at 13:55:52.818
starting worker pid=11556 on localhost:11234 at 13:55:52.818
Sleep...
Teardown cluster...

However, when I try to do the same from within an external R process, the parallel::makePSOCKcluster hangs. For example, suppose psock.R now contains the following code instead:

# Create a session.
session <- callr::r_session$new()

cat("Setup cluster...\n")
session$run(function() {
    # Create a cluster in the `.GlobalEnv`.
    cluster <<- parallel::makePSOCKcluster(
        rep("localhost", 2),
        master = "localhost",
        port = 11234,
        manual = FALSE,
        outfile = ""
    )
})

cat("Sleep...\n")
Sys.sleep(5)

cat("Teardown cluster...\n")
session$run(function() {
    # Stop it.
    parallel::stopCluster(cluster)
})

# Close session.
session$close()

Running psock.R now hangs at Setup cluster... for several minutes and then it exits with an error, i.e.:

Cluster setup failed. 1 worker of 2 failed to connect.

Furthermore, running htop --filter /exec/R confirms that the first worker manages to connect, but the second one doesn't. More specifically, I see the following processes (i.e., numbers added by me):

(1) └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
(2)    └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
(3) ├─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=parallel

where (1) is the psock.R script call, (2) is the external process created by callr::r_session$new(), and (3) is the first worker spawned by parallel::makePSOCKcluster that manages to connect.

Following Steve Weston's answers here and here, I tried to do some debugging.

I started by adjusting psock.R to write the output on the external process to a file and enabling manual mode by setting manual = TRUE, i.e.:

session$run(function() {
    # Connection.
    connection <- file("/some/path/log.txt", open = "wt")

    # Write anything to a log file.
    sink(connection, append = TRUE)
    sink(connection, append = TRUE, type = "message")

    # Create a cluster in the `.GlobalEnv`.
    cluster <<- parallel::makePSOCKcluster(
        rep("localhost", 2),
        master = "localhost",
        port = 11234,
        manual = TRUE,
        outfile = ""
    )
})

Running Rscript --vanilla psock.R after the code change above logs the following command to log.txt:

Manually start worker on localhost with
    '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)()' MASTER=localhost PORT=11234 OUT='' SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential

So far, no workers have been created, as can be seen from the output of htop --filter /exec/R:

└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
   └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore

Now, I can manually run the command above, twice, to create the workers and connect to the main (i.e., external) process. This results in the following output in the terminal, confirming that the workers have been created and connected:

starting worker pid=13065 on localhost:11234 at 14:59:38.185
starting worker pid=13201 on localhost:11234 at 14:59:38.185

I can also verify this via htop --filter /exec/R, which now shows the following:

└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
   └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential

At this point I am confused because I would expect something to fail. The only explanation I can think of is that the process I created to start the PSOCK cluster behaves differently than manually running that command in the terminal. Perhaps different permissions or missing environment variables? Is this to be expected?

I also noticed in the command printed by the manual mode that SETUPSTRATEGY=sequential is used. However, when manual = FALSE, the htop process shows that SETUPSTRATEGY=parallel is employed. This is, indeed, in line with the documentation of parallel::makePSOCKcluster which reads:

If "parallel" (default) workers will be started in parallel during cluster setup when this is possible, which is now for homogeneous "PSOCK" clusters with all workers started automatically (manual = FALSE) on the local machine. Workers will be started sequentially on other clusters, on all clusters with setup_strategy = "sequential" and on R 3.6.0 and older.

Out of curiosity, I tried running the command printed by the manual model, with the parallel strategy instead (i.e., SETUPSTRATEGY=parallel). After running it only once, the script resumes execution and stops at the Sys.sleep(5) part. This is also reflected in the htop process list, where I can only see one worker.

Everything works as expected if I disable the manual mode and add setup_strategy = "sequential" in the parallel::makePSOCKcluster function call. However, I really want to know why it fails with a parallel setup strategy. Furthermore, the code runs fine on a Windows machine, but hangs just the same on a Debian-based system.

In terms of what I'm running, this is what my R session looks like:

sessionInfo()

# R version 4.2.1 (2022-06-23)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Monterey 12.6
#
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.2.1 cli_3.3.0      jsonlite_1.8.0 rlang_1.0.5

And, since other answers have indicated host-related issues, my /etc/hosts looks like:

───────┼─────────────────────────────────────────────────────────
   1   │ ##
   2   │ # Host Database
   3   │ #
   4   │ # localhost is used to configure the loopback interface
   5   │ # when the system is booting.  Do not change this entry.
   6   │ ##
   7   │ 127.0.0.1       localhost
   8   │ 255.255.255.255 broadcasthost
   9   │ ::1             localhost
───────┴─────────────────────────────────────────────────────────

Also, my .Rprofile is empty.

Finally, not sure if this is relevant, but when running netstat -i, the values in the Address column for my network interfaces are in the format of IPv6, or rather IPv4-mapped.

Before I succumb to setup_strategy = "sequential", do you have any idea what is going on?


Edit 1.

Following HenrikB's suggestion, I replaced the parallel::makePSOCKcluster call with parallelly::makeClusterPSOCK and before the call I enabled the debug logs options(parallelly.debug = TRUE). The code now looks like this:

session$run(function() {
    # Enable logging for `parallelly`.
    options(parallelly.debug = TRUE)

    # Connection.
    connection <- file("/some/path/log.txt", open = "wt")

    # Write anything to a log file.
    sink(connection, append = TRUE)
    sink(connection, append = TRUE, type = "message")

    # Create a cluster in the `.GlobalEnv`.
    cluster <<- parallelly::makeClusterPSOCK(
        rep("localhost", 2),
        master = "localhost",
        port = 11234,
        manual = FALSE,
        outfile = ""
    )
})

The output in log.txt contains the following:

[09:27:19.936] Set package option ‘parallelly.availableCores.methods’
[09:27:19.941] Environment variable ‘R_PARALLELLY_AVAILABLECORES_METHODS’ not set
[09:27:19.941] Set package option ‘parallelly.availableCores.fallback’
[09:27:19.941] Environment variable ‘R_PARALLELLY_AVAILABLECORES_FALLBACK’ not set
[09:27:19.941] Set package option ‘parallelly.availableCores.min’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_MIN’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.system’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_SYSTEM’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.logical’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_LOGICAL’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.omit’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_OMIT’ not set
[09:27:19.942] Set package option ‘parallelly.availableWorkers.methods’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLEWORKERS_METHODS’ not set
[09:27:19.942] Set package option ‘parallelly.fork.enable’
[09:27:19.942] Environment variable ‘R_PARALLELLY_FORK_ENABLE’ not set
[09:27:19.942] Set package option ‘parallelly.supportsMulticore.unstable’
[09:27:19.942] Environment variable ‘R_PARALLELLY_SUPPORTSMULTICORE_UNSTABLE’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.setup_strategy’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SETUP_STRATEGY’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.validate’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_VALIDATE’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.connectTimeout’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_CONNECTTIMEOUT’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.timeout’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TIMEOUT’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.useXDR’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_USEXDR’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.socketOptions’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SOCKETOPTIONS’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.rshcmd’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSHCMD’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.rshopts’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSHOPTS’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.tries’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TRIES’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.tries.delay’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TRIES_DELAY’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.rscript_label’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSCRIPT_LABEL’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.sessionInfo.pkgs’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SESSIONINFO_PKGS’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.autoKill’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_AUTOKILL’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.master.localhost.hostname’
[09:27:19.945] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_MASTER_LOCALHOST_HOSTNAME’ not set
[09:27:19.945] Set package option ‘parallelly.makeNodePSOCK.port.increment’
[09:27:19.945] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_PORT_INCREMENT’ not set
[09:27:19.945] parallelly-specific environment variables:
[09:27:19.961] [local output] Workers: [n = 2] ‘localhost’, ‘localhost’
[09:27:19.962] [local output] Base port: 11234
[09:27:19.962] [local output] Getting setup options for 2 cluster nodes ...
[09:27:19.963] [local output]  - Node 1 of 2 ...
[09:27:19.963] [local output] localMachine=TRUE => revtunnel=FALSE

[09:27:19.963] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'’
[09:27:19.978] - Possible to infer worker's PID: FALSE
[09:27:19.978] [local output] Rscript port: 11234

[09:27:19.978] [local output]  - Node 2 of 2 ...
[09:27:19.979] [local output] localMachine=TRUE => revtunnel=FALSE

[09:27:19.979] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")'’
[09:27:19.995] - Possible to infer worker's PID: FALSE
[09:27:19.995] [local output] Rscript port: 11234

[09:27:19.995] [local output] Getting setup options for 2 cluster nodes ... done
[09:27:19.995] [local output]  - Parallel setup requested for some PSOCK nodes
[09:27:19.996] [local output] Setting up PSOCK nodes in parallel
[09:27:19.996] List of 20
[09:27:19.996]  $ local_cmd     : chr "'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,st"| __truncated__
[09:27:19.996]  $ worker        : chr "localhost"
[09:27:19.996]   ..- attr(*, "localhost")= logi TRUE
[09:27:19.996]  $ rank          : int 1
[09:27:19.996]  $ rshlogfile    : NULL
[09:27:19.996]  $ port          : int 11234
[09:27:19.996]  $ connectTimeout: num 120
[09:27:19.996]  $ timeout       : num 2592000
[09:27:19.996]  $ useXDR        : logi FALSE
[09:27:19.996]  $ pidfile       : chr "/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid"
[09:27:19.996]  $ setup_strategy: chr "parallel"
[09:27:19.996]  $ outfile       : chr ""
[09:27:19.996]  $ rshcmd_label  : NULL
[09:27:19.996]  $ rsh_call      : NULL
[09:27:19.996]  $ cmd           : chr "'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,st"| __truncated__
[09:27:19.996]  $ localMachine  : logi TRUE
[09:27:19.996]  $ manual        : logi FALSE
[09:27:19.996]  $ dryrun        : logi FALSE
[09:27:19.996]  $ quiet         : logi FALSE
[09:27:19.996]  $ rshcmd        : NULL
[09:27:19.996]  $ revtunnel     : logi FALSE
[09:27:19.996]  - attr(*, "class")= chr [1:2] "makeNodePSOCKOptions" "makeNodeOptions"
[09:27:20.001] [local output] System call to launch all workers:
[09:27:20.001] [local output] '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11234 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=parallel
[09:27:20.001] [local output] Starting PSOCK main server
[09:27:20.003] [local output] Workers launched
[09:27:20.003] [local output] Waiting for workers to connect back
[09:27:20.003] [local output] 0 workers out of 2 ready
[09:27:20.120] [local output] 0 workers out of 2 ready
[09:27:20.121] [local output] 1 workers out of 2 ready
[09:29:20.126] [local output] 1 workers out of 2 ready
[09:31:20.133] [local output] 1 workers out of 2 ready
Error in parallelly::makeClusterPSOCK(rep("localhost", 2), master = "localhost",  : 
  Cluster setup failed. 1 worker of 2 failed to connect.
In addition: Warning messages:
1: In system(test_cmd, intern = TRUE, input = input) :
  running command ''/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'' had status 2
2: In system(test_cmd, intern = TRUE, input = input) :
  running command ''/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")'' had status 2

It looks like the Rscript command has an exit status of 2, which from the documentation sounds like something cryptic happened:

Error status 2 is used for R 'suicide', that is a catastrophic failure, and other small numbers are used by specific ports for initialization failures.

Upvotes: 1

Views: 951

Answers (1)

HenrikB
HenrikB

Reputation: 6805

UPDATE 2022-10-10: The underlying problem was due to a bug in R itself, which has been fixed in R-devel r83051 (2022-10-10) https://github.com/wch/r-source/commit/97b3dfb71aeff4a6acb72d400bb1fba8e6b2ed37. Not sure, but I suspect this fix will also make it into R v4.2.2 (end of October 2022).

With a fixed version of R, OP's example works also with processx 3.7.0 currently on CRAN.

UPDATE 2022-10-09:

The problem is related to the processx package and R itself requiring a working standard input (stdin) stream. Specifically, callr (when using processx <= 3.7.0) marks all file descriptors as close-on-exec and system2() assumes that stdin (etc.) will be inherited in the child process, instead of explicitly duplicating it. This caused system() to fail, cf. https://github.com/r-lib/callr/issues/236. Because of this, the processx package (>= 3.7.0-9000) was updated to not close the standard streams (stdin, stdout, and stderr) by default. This solved the problem reported by OP (confirmed by OP on 2022-10-08T21:05:46 UTC).

Until the fixed version of processx package is on CRAN, the developers version can be installed as:

remotes::install_github("r-lib/processx")

Thanks to your updated debug output(*) from parallelly::makeClusterPSOCK(), I've narrowed it in to what right now looks like a bug in callr. I've reported this upstream at https://github.com/r-lib/callr/issues/236, which also illustrates the core of the problem.

(*) The exact thing that caught my eye was:

[09:27:19.963] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'’
[09:27:19.978] - Possible to infer worker's PID: FALSE

and more precisely, that it failed to infer the worker's PID when called with a background callr process. If you do the same in the main R session, you'll get:

[09:27:19.978] - Possible to infer worker's PID: TRUE

Upvotes: 2

Related Questions