R parallel cluster worker process never returns

Question

I am using the doParallel package to parallelize jobs across multiple Linux machines using the following syntax:

cl <- makePSOCKcluster(machines, outfile = '', master = system('hostname -i', intern = T))

Typically each job would take less than 10 minutes to run on a single machine. However, sometimes there would be one worker process that would 'run away' and kept running for hours and never returned to the main driver process. I can see the process running using top, but it seems like the process is somehow stuck rather than running anything for real. The outfile='' option doesn't produce anything useful since the worker process never really failed.

This happens rather frequently but very randomly. Sometimes I could just re-start the jobs and they would finish fine. Therefore, I cannot provide a reproducible example. Does anyone have general suggestions on how to investigate this issue? Or what to look for when this happens again in the future?

EDIT:

Adding more details in response to the comments. I am running thousands of small simulations on 10 machines. IO and memory usage are both minimal. I have noticed the worker process running away on different machines at random without any pattern, not necessarily the busiest ones. I don't have permission to view the system log, but based on CPU/RAM history there doesn't seem to be anything unusual.

It happens frequently enough that it's fairly easy to catch a run-away process in action. top would show that the process is running with close to 100% CPU with status R, so it is definitely running and not waiting. But I am also quite sure that each simulation should only take minutes, and somehow the run-away worker just keeps running non-stop.

So far doParallel is the only package I have tried. I am exploring other options, but it's hard to make an informed decision without knowing the cause.

Steve Weston · Accepted Answer

This kind of problem is not uncommon on large compute clusters. Although the hung worker process may not produce any error message, you should check the system logs on the node where the worker was executed to see if any system problem has been reported. There could be disk or memory errors, or the system might have run low on memory. If a node is having problems, your problem could be solved by simply not using that node.

This is one of the reasons that batch queueing systems are useful. Jobs that take too long can be killed and automatically resubmitted. Unfortunately, they often rerun the job on the same bad node, so it's important to detect bad nodes and prevent the scheduler from using them for subsequent jobs.

You might want to consider adding checkpointing capabilities to your program. Unfortunately, that is generally difficult, and especially difficult using the doParallel backend since there is no checkpointing capability in the parallel package. You might want to investigate the doRedis backend, since I believe the author was interested in supporting certain fault tolerance capabilities.

Finally, if you actually catch a hung worker in the act, you should get as much information about it as possible using "ps" or "top". The process state is important since that could help you to determine if the process is stuck trying to perform I/O, for example. Even better, you could try attaching gdb to it and get a traceback to determine what it is actually doing.

R parallel cluster worker process never returns

Answers (1)

Related Questions