Reputation: 446
I am currently having problems with a MPI Application.
I am sporadically receiving MPI errors of the form:
Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(1339)...............: MPI_Allreduce(sbuf=0x7ffa87ffcb98, rbuf=0x7ffa87ffcba8, count=2, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1180).........:
MPIR_Allreduce_intra(755).........:
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 14 truncated; 384 bytes received but buffer size is 16
rank 1 in job 1 l1442_42561 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
However I do not know at where to look. I know that the error is happening in an Allreduce function call however there are multiple ones.
How do I know which function call produces the error? Simple printf debugging does not help as the function could be called a million times before the error occurs the first time.
It might also not occur at all or immediately after the start of the program.
Upvotes: 1
Views: 2913
Reputation: 446
I have been able to track down the origin of the error by calling
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN)
and then checking the return value of each of the Allreduce functions for not being equal to MPI_SUCCESS. This is a location where an error occurs
Upvotes: 1