Reputation: 895
I use MPICH2. When I launch processes with mpiexec, the failure of one process will crash all other processes. How to avoid this?
Upvotes: 4
Views: 361
Reputation: 9072
In MPICH, there is a flag called -disable-auto-cleanup
which will prevent the process manager from automatically cleaning up all processes when a single process fails.
However, MPI itself does not have much support for fault tolerance and this is something that the Fault Tolerance Working Group is working on adding in a future version of the MPI Standard.
For now, the best you can do is change the default MPI Error Handler away from MPI_ERRORS_ARE_FATAL
, which causes all processes to abort, to something else like MPI_ERRORS_RETURN
which would return the error code to the application and allow it to do something else. However, you're not likely to be able to communicate anymore after a failure has occurred, especially if you are trying to use collective communication.
Upvotes: 4