Reputation: 589
Folks, I am stumbling upon quite a weird issue. I am running a job with mpirun command:
mpirun -np 4 ~/opt/stuff/OSMC
Sometimes (the execution depends on a number of random values) one of the four processes dies:
Image PC Routine Line Source
OSMC 000000000050B54D Unknown Unknown Unknown
OSMC 000000000050A055 Unknown Unknown Unknown
OSMC 00000000004BA320 Unknown Unknown Unknown
OSMC 000000000047976F Unknown Unknown Unknown
OSMC 0000000000479B72 Unknown Unknown Unknown
OSMC 000000000043B7DC mpi_m_mp_exchange 306 mpi_m.f90
OSMC 0000000000430880 mpi_m_mp_coagulat 85 mpi_m.f90
OSMC 000000000041304B op_m_mp_op_run_ 81 op_m.f90
OSMC 000000000040FF22 osmc_m_mp_run_ 543 OSMC_m.f90
OSMC 000000000040FD09 MAIN__ 28 OSMC_m.f90
OSMC 000000000040FC4C Unknown Unknown Unknown
libc.so.6 000000362081ED5D Unknown Unknown Unknown
OSMC 000000000040FB49 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 28468 on
node rcfen04 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
The system prints no core dump, so I have no more informations apart of this short summary. I gave a look in mpi_m.f90 line 306, where an existing array is set to 0. The system should be able to print a core dump file, since:
[user@host path]$ ulimit -a
core file size (blocks, -c) unlimited
...
This is the piece of code that is reported in the short summary:
module mpi_m
implicit none
...
real(wp),allocatable :: part(:,:) ! ARRAY DECLARATION
...
allocate( part_(pdim,is_:ie_) ) ! ARRAY ALLOCATION
...
subroutine exchanger_compute_bij(ierr,msg)
implicit none
...
part = 0.0_wp ! HERE CODE CRASHES
...
end subroutine
...
end module
Nothing seems wrong to me. The incriminated instruction is a fortran vector operation, should be fine. It crashes even when I compile with bound checking.
How can I determine the reason for this sudden crash? I hoped a core dump file, given to Totalview or some other debugger, could have helped..
Upvotes: 0
Views: 506