Colin T Bowers
Colin T Bowers

Reputation: 18580

Julia doesn't like it when I add and remove processes without doing any parallel processing

UPDATE: Confirmed as a bug. For more detail, see the link and details provided by @ViralBShah below.

Julia throws a strange error when I add and remove processes (addprocs and rmprocs), but only if I don't do any parallel processing in between. Consider the following example code:

#Set parameters
numCore = 4;

#Add workers
print("Adding workers... ");
addprocs(numCore - 1);
println(string(string(numCore-1), " workers added."));

#Detect number of cores
println(string("Number of processes detected = ", string(nprocs())));

# Do some stuff (COMMENTED OUT)
# XLst = {rand(10, 1) for i in 1:8};
# XMean = pmap(mean, XLst);

#Remove the additional workers
print("Removing workers... ");
rmprocs(workers());
println("Done.");
println("Subroutine complete.");

Note that I've commented out the only code that actually does any parallel processing (the call to pmap). If I run this code on my machine (Julia 0.2.1, Ubuntu 14.04), I get the following output in the console:

Adding workers... 3 workers added.
Number of processes detected = 4
Removing workers... Done.
Subroutine complete.
fatal error on 
In  [86]: fatal error on 88: ERROR: 87: ERROR: connect: connection refused (ECONNREFUSED)
 in yield at multi.jl:1540
connect: connection refused (ECONNREFUSED) in wait at task.jl:117
 in wait_connected at stream.jl:263
 in connect at stream.jl:878
 in Worker at multi.jl:108
 in anonymous at task.jl:876

 in yield at multi.jl:1540
 in wait at task.jl:117
 in wait_connected at stream.jl:263
 in connect at stream.jl:878
 in Worker at multi.jl:108
 in anonymous at task.jl:876

The first four lines are printed by my program, and seem to indicate that it runs to completion. But then I get a fatal error. Any ideas?

The most interesting thing about this error is if I uncomment the code with the call to pmap (ie if I actually do some parallel processing), the fatal error goes away.

Upvotes: 0

Views: 2100

Answers (1)

ViralBShah
ViralBShah

Reputation: 307

This issue is being tracked at https://github.com/JuliaLang/julia/issues/7646 and I reproduce the answer by Amit Murthy:

  1. pid 1 does an addprocs(3)
  2. addprocs returns after it has established connections with all 3 new workers.
  3. However, at this time the the connections between workers may not have been setup, i.e. from pids 3 -> 2, 4 -> 2 and 4 -> 3.
  4. Now pid 1 calls rmprocs(workers()) , i.e., pids 2, 3 and 4.
  5. As pid 2 exits, the connection attempt in 4 to 2, results in an error.
  6. Since we have redirected the output of pid 4, to the stdout of pid 1, we see the same error printed. The system is still in a consistent state, though the printing of said error messages may suggest something amiss.

Upvotes: 4

Related Questions