How to handle MPI error when program crashes

Question

Almost all MPI routines return an error handler. However, a communication error usually crashes the program at the spot where the MPI routine is called and makes the error handler useless. Is there a way to catch the error in such a case? Or alternatively, how to prevent the program from crashing when a catastrophic error happens so that we can catch the error?

Gilles · Accepted Answer

The behavior of MPI functions upon error has slightly changed with the latest standards. It used to be managed with the MPI_Errhandler_{get|set|create}() functions (deprecated since MPI 2.0 and removed since MPI 3.0).
It is now managed through the MPI_{Comm|Win|File}_{get|set|create}_errhandler() functions. This gives much greater level of possible adjustments in this management.

There are two predefined error handlers that all MPI libraries propose (although some more can be proposed as well):

MPI_ERRORS_ARE_FATAL which aborts the entire program whenever an error occurs within an associated MPI call; and
MPI_ERRORS_RETURN which simply returns from the associated MPI call upon error, with the corresponding error code.

By default, the behavior is that all MPI calls but the ones associated with Input/Output actions trigger abortion in case of error. Conversely, the MPI-IO calls will normally return from error with the corresponding error code. Actually, the standard is a bit less prescriptive and says:

By default, communication errors are fatal -- MPI_ERRORS_ARE_FATAL is the default error handler associated with MPI_COMM_WORLD. I/O errors are usually less catastrophic (e.g., "file not found") than communication errors, and common practice is to catch these errors and continue executing.

So to answer plainly to your questions, if you want to prevent the code from crashing upon error, catch them and implement some contingency procedure, you have mostly two solutions:

Ad-hoc solution: set the error handler to be MPI_ERRORS_RETURN for the communicator, file or window you want and check the error code upon completion of the associated MPI calls. You will then have to take action based on the exact error returned each time, bearing in mind that once an error occurred inside a MPI call, there is no guanranty that any further MPI call will succeed. Indeed, there are all chances that any subsequent call to MPI will crash.
More elaborated: create a custom error handler which will possibly print extra details that you might want to see or take further useful actions, before to either return or abort. You can create several different of these and associate them selectively to the communicators, windows or files you want. You can even think, if you are coding in C++, of creating your own exception classes and raising them this way.

But again, the fact that no MPI call is guaranteed to succeed after a first error was encountered within the library greatly limits the scope of what can be done, so most of the time, the default behavior is perfectly suited and can be kept untouched.

How to handle MPI error when program crashes

Answers (1)

Related Questions