puk
puk

Reputation: 16782

Does MPI blocking call (MPI_Send/Recv) have a time limit?

I am submitting MPI jobs on my university cluster. With larger programs I have noticed that during one of my final communication routines, my program crashes with almost no helpful error message.

mpirun noticed that process rank 0 with PID 5466 on node red0005 exited on signal 9 (Killed).

The only thing helpful in all of that is that rank 0 caused the problem. Since this final communication routine works as follows (where <--> means MPI_Send/Recv)

   rank 0    rank 1    rank 2    rank 3 ...    rank n
     |        <-->      <-->      <-->          <-->
     |
     |
     |
     |
     |
     |
     |
     V
  ----------------------MPI_Barrier()------------------

My guess is that rank 0 hits MPI_Barrier() waits for a very long period (570-1200 s) then causes an exception. Alternatively, the computers might run out of memory. When my local machine runs out of memory, I get a very detailed out of memory warning, but I have no idea what is going on on the remote machine. Any ideas what this might mean?

Upvotes: 0

Views: 1313

Answers (1)

jman
jman

Reputation: 11616

Its most definitely not a timeout. MPI routines do not have such exceptions. If your cluster has a different MPI library (or the same MPI library compiled with a different compiler) or startup mechanism, give that a try. Its probably an issue with the library (or a bug in your program).

Upvotes: 2

Related Questions