Reputation: 16782
I am submitting MPI jobs on my university cluster. With larger programs I have noticed that during one of my final communication routines, my program crashes with almost no helpful error message.
mpirun noticed that process rank 0 with PID 5466 on node red0005 exited on signal 9 (Killed).
The only thing helpful in all of that is that rank 0 caused the problem. Since this final communication routine works as follows (where <-->
means MPI_Send/Recv
)
rank 0 rank 1 rank 2 rank 3 ... rank n
| <--> <--> <--> <-->
|
|
|
|
|
|
|
V
----------------------MPI_Barrier()------------------
My guess is that rank 0 hits MPI_Barrier()
waits for a very long period (570-1200 s) then causes an exception. Alternatively, the computers might run out of memory. When my local machine runs out of memory, I get a very detailed out of memory warning, but I have no idea what is going on on the remote machine. Any ideas what this might mean?
Upvotes: 0
Views: 1313
Reputation: 11616
Its most definitely not a timeout. MPI routines do not have such exceptions. If your cluster has a different MPI library (or the same MPI library compiled with a different compiler) or startup mechanism, give that a try. Its probably an issue with the library (or a bug in your program).
Upvotes: 2