dodolong
dodolong

Reputation: 895

MPICH2, the failure of one process will crash all other processes

I use MPICH2. When I launch processes with mpiexec, the failure of one process will crash all other processes. How to avoid this?

Upvotes: 4

Views: 361

Answers (1)

Wesley Bland
Wesley Bland

Reputation: 9072

In MPICH, there is a flag called -disable-auto-cleanup which will prevent the process manager from automatically cleaning up all processes when a single process fails.

However, MPI itself does not have much support for fault tolerance and this is something that the Fault Tolerance Working Group is working on adding in a future version of the MPI Standard.

For now, the best you can do is change the default MPI Error Handler away from MPI_ERRORS_ARE_FATAL, which causes all processes to abort, to something else like MPI_ERRORS_RETURN which would return the error code to the application and allow it to do something else. However, you're not likely to be able to communicate anymore after a failure has occurred, especially if you are trying to use collective communication.

Upvotes: 4

Related Questions