How to disable processes that may be stuck in MPI

Question

My program is just a procedure to diagonalize the Hamiltonian of a certain system. I'm trying to work through a very large system which is very computationally heavy. In fact, if I increase the system size by one unit it exceeds the maximum size that LAPACK can diagonalize. (Note: Does not scale linearly with number of units).

I'm currently trying to get a very high resolution result which means I need to average the calculation roughly 10,000 times. If I were to keep the code serial it would take about 300 hours to complete so I have parallelized my program.

I have set it up such that I run 1,000 calculations on 10 different cores and combine them at the end. I did this quite a while ago so I don't believe that this is the issue.

My issue is that it appears there is a bug in my code that causes the program to get "stuck". Unfortunately, it happens on only one of my ten CPUs, each doing 1,000 calculations. It could be as rare as 1 in 10,000 runs (a very specific scenario).

I know that it is getting stuck because I have an MPI_reduce call inside my program. In addition, I output to screen the progress of each process (every 10% complete). I can identify that my master process can't continue past MPI_reduce because one of the other 10 processes has failed (but not stopped). I can easily identify which process failed.

I don't have the time to find and fix the bug so what I'm looking for is the following:

Is it bad practice to do actual calculations on my Master process? Or should my master process just be left to communicate and do calculations at the end.
How do I cancel a process (so that my master process can continue past MPI_reduce) from within the program?

Problem with 2: My master process can't read any additional lines while its waiting for the other processes to reach the mpi_reduce command.

I am programming in Fortran using OpenMPI and the mpifort compiler.

How to disable processes that may be stuck in MPI

Answers (1)

Related Questions