Sparonuz
Sparonuz

Reputation: 145

Seg fault in fortran MPI_COMM_CREATE_GROUP if using a group not directly created from MPI_COMM_WORLD

I'm having a segmentation fault that I can not really understand in a simple code, that just:

Specifically I use this last call, instead of just using MPI_COMM_CREATE, because it's only collective over the group of processes contained in group, while MPI_COMM_CREATE is collective over every process in COMM. The code is the following

program mpi_comm_create_grp
  use mpi
  IMPLICIT NONE

  INTEGER :: mpi_size,  mpi_err_code
  INTEGER :: my_comm_dup, mpi_new_comm, mpi_group_world, mpi_new_group
  INTEGER :: rank_index
  INTEGER, DIMENSION(:), ALLOCATABLE :: rank_vec

  CALL mpi_init(mpi_err_code)
  CALL mpi_comm_size(mpi_comm_world, mpi_size, mpi_err_code)

  !! allocate and fill the vector for the new group
  allocate(rank_vec(mpi_size/2))
  rank_vec(:) = (/ (rank_index , rank_index=0, mpi_size/2) /)

  !! create the group directly from the comm_world: this way works
  ! CALL mpi_comm_group(mpi_comm_world, mpi_group_world, mpi_err_code)

  !! duplicating the comm_world creating the group form the dup: this ways fails
  CALL mpi_comm_dup(mpi_comm_world, my_comm_dup, mpi_err_code)
  !! creatig the group of all processes from the duplicated comm_world
  CALL mpi_comm_group(my_comm_dup, mpi_group_world, mpi_err_code)

  !! create a new group with just half of processes in comm_world
  CALL mpi_group_incl(mpi_group_world, mpi_size/2, rank_vec,mpi_new_group, mpi_err_code)

  !! create a new comm from the comm_world using the new group created
  CALL mpi_comm_create_group(mpi_comm_world, mpi_new_group, 0, mpi_new_comm, mpi_err_code)

  !! deallocate and finalize mpi
  if(ALLOCATED(rank_vec)) DEALLOCATE(rank_vec)
  CALL mpi_finalize(mpi_err_code)
end program !mpi_comm_create_grp

If instead of duplicating the COMM_WORLD, I directly create the group from the global communicator (commented line), everything works just fine.

The parallel debugger I'm using traces back the seg fault to a call to MPI_GROUP_TRANSLATE_RANKS, but, as far as I know, the MPI_COMM_DUP duplicates all the attributes of the copied communicator, ranks numbering included.

I am using the ifort version 18.0.5, but I also tried with the 17.0.4, and 19.0.2 with no better results.

Upvotes: 3

Views: 337

Answers (2)

Sparonuz
Sparonuz

Reputation: 145

Like suggested in the comments I wrote to openmpi user list, and they replied

That is perfectly valid. The MPI processes that make up the group are all part of comm world. I would file a bug with Intel MPI.

So I try and post a question on Intel forum. It is a bug they solved in the last version of the libray, 19.3.

Upvotes: 1

Sparonuz
Sparonuz

Reputation: 145

Well the thing is a little tricky, at least for me, but after some tests and help, the root of the problem was found.

In the code

CALL mpi_comm_create_group(mpi_comm_world, mpi_new_group, 0, mpi_new_comm, mpi_err_code)

Creates a new communicator for the group mpi_new_group, previously created. However the mpi_comm_world, which is used as first argument, is not in the same context as mpi_new_group, as explained in the mpich reference:

MPI_COMM_DUP will create a new communicator over the same group as comm but with a new context

So the correct call would be:

CALL mpi_comm_create_group(my_comm_copy, mpi_new_group, 0, mpi_new_comm, mpi_err_code)

I.e. , replacing the mpi_comm_world for my_comm_copy, that is the one from which the mpi_group_world was created.

I am still not sure why it is working with OpenMPI, but it is generally more tolerant with this sort of things.

Upvotes: 3

Related Questions