Pietro
Pietro

Reputation: 13182

What happens if an MPI process crashes?

I am evaluating different multiprocessing libraries for a fault tolerant application. I basically need any process to be allowed to crash without stopping the whole application.

I can do it using the fork() system call. The limit here is that the process can be created on the same machine, only.

Can I do the same with MPI? If a process created with MPI crashes, can the parent process keep running and eventually create a new process?

Is there any alternative (possibly multiplatform and open source) library to get the same result?


As reported here, MPI 4.0 will have support for fault tolerance.

Upvotes: 1

Views: 1604

Answers (2)

Wesley Bland
Wesley Bland

Reputation: 9072

If you're willing to go for something research quality, there's two implementations of a potential fault tolerance chapter for a future version of MPI (MPI-4?). The proposal is called User Level Failure Mitigation. There's an experimental version in MPICH 3.2a2 and a branch of Open MPI that also provides the interfaces. Both are far from production quality, but you're welcome to try them out. Just know that since this isn't in the MPI Standard, the function prefixes are not MPI_*. For MPICH, they're MPIX_*, for the Open MPI branch, they're OMPI_* (though I believe they'll be changing theirs to be MPIX_* soon as well.

As Rob Latham mentioned, there will be lots of work you'll need to do within your app to handle failures, though you don't necessarily have to check all of your return codes. You can/should use MPI error handlers as a callback function to simplify things. There's information/examples in the spec available along with the Open MPI branch.

Upvotes: 2

Rob Latham
Rob Latham

Reputation: 5223

If you want collectives, you're going to have to wait for MPI-3.something (as High Performance Mark and Hristo Illev suggest)

If you can live with point-to-point, and you are a patient person willing to raise a bunch of bug reports against your MPI implementation, you can try the following:

  • disable the default MPI error handler
  • carefully check every single return code from your MPI programs
  • keep track in your application which ranks are up and which are down. Oh, and when they go down they can never get back. but you're unable to use collectives anyway (see my opening statement), so that's not a huge deal, right?

Here's an old paper (back when Bill still worked at Argonne. I think it's from 2003): http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf . It lays out the kinds of fault tolerant things one can do in MPI. Perhaps such a "constrained MPI" might still work for your needs.

Upvotes: 3

Related Questions