Blackclaws
Blackclaws

Reputation: 446

How to find the origin of MPI message truncated errors?

I am currently having problems with a MPI Application.

I am sporadically receiving MPI errors of the form:

Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(1339)...............: MPI_Allreduce(sbuf=0x7ffa87ffcb98, rbuf=0x7ffa87ffcba8, count=2, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1180).........: 
MPIR_Allreduce_intra(755).........: 
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 14 truncated; 384 bytes  received but buffer size is 16
rank 1 in job 1  l1442_42561   caused collective abort of all ranks
exit status of rank 1: killed by signal 9 

However I do not know at where to look. I know that the error is happening in an Allreduce function call however there are multiple ones.

How do I know which function call produces the error? Simple printf debugging does not help as the function could be called a million times before the error occurs the first time.

It might also not occur at all or immediately after the start of the program.

Upvotes: 1

Views: 2913

Answers (1)

Blackclaws
Blackclaws

Reputation: 446

I have been able to track down the origin of the error by calling

MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN)

and then checking the return value of each of the Allreduce functions for not being equal to MPI_SUCCESS. This is a location where an error occurs

Upvotes: 1

Related Questions