marco
marco

Reputation: 589

mpirun: one process terminates but prints no core dump

Folks, I am stumbling upon quite a weird issue. I am running a job with mpirun command:

   mpirun -np 4 ~/opt/stuff/OSMC

Sometimes (the execution depends on a number of random values) one of the four processes dies:

  Image              PC                Routine            Line        Source             
  OSMC               000000000050B54D  Unknown               Unknown  Unknown
  OSMC               000000000050A055  Unknown               Unknown  Unknown
  OSMC               00000000004BA320  Unknown               Unknown  Unknown
  OSMC               000000000047976F  Unknown               Unknown  Unknown
  OSMC               0000000000479B72  Unknown               Unknown  Unknown
  OSMC               000000000043B7DC  mpi_m_mp_exchange         306  mpi_m.f90
  OSMC               0000000000430880  mpi_m_mp_coagulat          85  mpi_m.f90
  OSMC               000000000041304B  op_m_mp_op_run_            81  op_m.f90
  OSMC               000000000040FF22  osmc_m_mp_run_            543  OSMC_m.f90
  OSMC               000000000040FD09  MAIN__                     28  OSMC_m.f90
  OSMC               000000000040FC4C  Unknown               Unknown  Unknown
  libc.so.6          000000362081ED5D  Unknown               Unknown  Unknown
  OSMC               000000000040FB49  Unknown               Unknown  Unknown
  --------------------------------------------------------------------------
  mpirun has exited due to process rank 1 with PID 28468 on
  node rcfen04 exiting without calling "finalize". This may
  have caused other processes in the application to be
  terminated by signals sent by mpirun (as reported here).
  --------------------------------------------------------------------------

The system prints no core dump, so I have no more informations apart of this short summary. I gave a look in mpi_m.f90 line 306, where an existing array is set to 0. The system should be able to print a core dump file, since:

  [user@host path]$ ulimit -a
  core file size          (blocks, -c) unlimited
  ...

This is the piece of code that is reported in the short summary:

  module mpi_m
    implicit none
    ...
    real(wp),allocatable :: part(:,:) ! ARRAY DECLARATION
    ...
    allocate( part_(pdim,is_:ie_) )   ! ARRAY ALLOCATION
    ...
    subroutine exchanger_compute_bij(ierr,msg)
    implicit none
    ...
    part = 0.0_wp                     ! HERE CODE CRASHES
    ...
    end subroutine
    ...
  end module

Nothing seems wrong to me. The incriminated instruction is a fortran vector operation, should be fine. It crashes even when I compile with bound checking.

How can I determine the reason for this sudden crash? I hoped a core dump file, given to Totalview or some other debugger, could have helped..

Upvotes: 0

Views: 506

Answers (0)

Related Questions