What happens if an MPI process crashes?

Question

I am evaluating different multiprocessing libraries for a fault tolerant application. I basically need any process to be allowed to crash without stopping the whole application.

I can do it using the fork() system call. The limit here is that the process can be created on the same machine, only.

Can I do the same with MPI? If a process created with MPI crashes, can the parent process keep running and eventually create a new process?

Is there any alternative (possibly multiplatform and open source) library to get the same result?

As reported here, MPI 4.0 will have support for fault tolerance.

Rob Latham · Accepted Answer

If you want collectives, you're going to have to wait for MPI-3.something (as High Performance Mark and Hristo Illev suggest)

If you can live with point-to-point, and you are a patient person willing to raise a bunch of bug reports against your MPI implementation, you can try the following:

disable the default MPI error handler
carefully check every single return code from your MPI programs
keep track in your application which ranks are up and which are down. Oh, and when they go down they can never get back. but you're unable to use collectives anyway (see my opening statement), so that's not a huge deal, right?

Here's an old paper (back when Bill still worked at Argonne. I think it's from 2003): http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf . It lays out the kinds of fault tolerant things one can do in MPI. Perhaps such a "constrained MPI" might still work for your needs.

What happens if an MPI process crashes?

Answers (2)

Related Questions