Gabriel
Gabriel

Reputation: 9432

Handling Signals in an MPI Application / Gracefully exit

How can signals be handled safley in and MPI application (for example SIGUSR1 which should tell the application that its runtime has expired and should terminate in the next 10 min.) I have several constraints:

How can this be achieved safely, no deadlocks while trying to exit, and properly leaving the current context jumping back to main() and calling MPI_FINALIZE() ? Somehow the processes have to aggree on exiting (I think this is the same in multithreaded applicaitons) but how is that done efficiently without having to communicate to much? Is anybody aware of some standart way of doing this properly?

Below are some thought which might or might not work:

Idea 1:
Lets say for each process we catch the signal in a signal handler and push it on a "unhandled signals stack" (USS) and we simply return from the signal handler routine . We then have certain termination points in our application especially before and after IO operations which then handle all signals in USS. If there is a SIGUSR1 in USS for example, each process would then exit at a termination point.

Idea 2:
Only the master process 0 catches the signal in the signal handler and then sends a broadcast message : "all process exit!" at a specific point in the application. All processes receive the broadcast and throw and exception which is catched in main and MPI_FINALIZE is called.

Thanks a lot!

Upvotes: 2

Views: 1593

Answers (2)

Wesley Bland
Wesley Bland

Reputation: 9062

Using signals in your MPI application in general is not safe. Some implementations may support it and others may not.

For instance, in MPICH, SIGUSR1 is used by the process manager for internal notification of abnormal failures.

http://lists.mpich.org/pipermail/discuss/2014-October/003242.html

Open MPI on the other had will forward SIGUSR1 and SIGUSR2 from mpiexec to the other processes.

http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect14

Other implementations will differ. So before you go too far down this route, make sure that the implementation you're using can deal with it.

Upvotes: 1

If your goal is to stop all processes at the same point, then there is no way around always synchronizing at the possible termination points. That is, a collective call at the termination points is required.

Of course, you can try to avoid an extra broadcast by using the synchronization of another collective call to ensure proper termination, or piggy-pack the termination information on an existing broadcast, but I don't think that's worth it. After all, you only need to synchronize before I/O and at least once per ten minutes. At such a frequency, even a broadcast is not a performance problem.

Upvotes: 1

Related Questions