Reputation: 41
My program is just a procedure to diagonalize the Hamiltonian of a certain system. I'm trying to work through a very large system which is very computationally heavy. In fact, if I increase the system size by one unit it exceeds the maximum size that LAPACK can diagonalize. (Note: Does not scale linearly with number of units).
I'm currently trying to get a very high resolution result which means I need to average the calculation roughly 10,000 times. If I were to keep the code serial it would take about 300 hours to complete so I have parallelized my program.
I have set it up such that I run 1,000 calculations on 10 different cores and combine them at the end. I did this quite a while ago so I don't believe that this is the issue.
My issue is that it appears there is a bug in my code that causes the program to get "stuck". Unfortunately, it happens on only one of my ten CPUs, each doing 1,000 calculations. It could be as rare as 1 in 10,000 runs (a very specific scenario).
I know that it is getting stuck because I have an MPI_reduce call inside my program. In addition, I output to screen the progress of each process (every 10% complete). I can identify that my master process can't continue past MPI_reduce because one of the other 10 processes has failed (but not stopped). I can easily identify which process failed.
I don't have the time to find and fix the bug so what I'm looking for is the following:
Is it bad practice to do actual calculations on my Master process? Or should my master process just be left to communicate and do calculations at the end.
How do I cancel a process (so that my master process can continue past MPI_reduce) from within the program?
Problem with 2: My master process can't read any additional lines while its waiting for the other processes to reach the mpi_reduce command.
I am programming in Fortran using OpenMPI and the mpifort compiler.
Upvotes: 1
Views: 437
Reputation: 22670
There is nothing wrong about using rank 0 for computation unless you know that it introduces a special bottleneck.
There is currently no way to recover in MPI if one rank gets stuck.
There are some effort towards fault tolerance, but that is primarily meant to survive hardware errors.
Whether you want it or not, you must really fix your code. If you have a bug that you do not understand, all your results are completely worthless (unless you have a separate method to completely validate the results). It does not matter how rarely this bug manifests as hanging. It would be irresponsible to use the results for scientific work unless you can make a strong case that it does not influence the results.
Upvotes: 1