Adrien Roussel
Adrien Roussel

Reputation: 49

How to debug MPI program before bad termination?

I am currently developing a program written in C++ with the MPI+pthread paradigm.

I add some functionality to my program, however I have a bad termination message from one MPI process, like this:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 37805 RUNNING AT node165
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@node162] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:0@node162] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node166] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:2@node166] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node166] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node162: task 0: Exited with exit code 7
[proxy:0:0@node162] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node166: task 2: Exited with exit code 7
[mpiexec@node162] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node162] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node162] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@node162] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

My problem is such that I have no idea about why I have this kind of message, and thus how to correct it.

I use only some basic functions from MPI, and ensure that there is no threads which uses MPI calls (only my "master process" is allowed to call such functions).

I also checked that one process does not send message to itself, and that the process destination exist before sending a message.

My question is quite simple: how to know where the problem comes from to then debug my application ?

Thank you a lot.

Upvotes: 0

Views: 3588

Answers (2)

Paka101
Paka101

Reputation: 91

My experience with this problem when writing in C++ and using MPI is that this frequently occurred when I did not set MPI_Finalze(); before every return statement.

Upvotes: 0

David
David

Reputation: 754

one of your processes has had a segmentation fault. This means reading from or writing to an area of memory that it is not permitted to.

That's the cause and MPI functions often are difficult to get right the first time - for example it could be MPI send and receive functions with incorrect sizes or locations.

The best solution is to fire up a parallel debugger so that you can watch all the processes. It looks like you are using a proper HPC system so there is a chance that there is one installed on the system -- ddt or totalview are the most popular.

Take a look at How to debug an MPI program

Upvotes: 1

Related Questions