Enlico
Enlico

Reputation: 28416

MPI communication with different sized datatypes

Suppose a program is run on xp times yp times zp processes. A cartesian communicator is used such that the processes can be thought to be arranged in a grid of dimensions (xp,yp,zp). In this program the root process (0) declares and allocates a 3D array Atot which is going to be filled by 3D arrays A declared by each process (root included).

INTEGER, DIMENSION(3) :: Ntot
INTEGER, DIMENSION(3) :: N
INTEGER, DIMENSION(:,:,:), ALLOCATABLE :: Atot
INTEGER, DIMENSION(:,:,:), ALLOCATABLE :: A
:
! the 3 elements of the array N are determined by dividing the corresponding
! element of the array Ntot by the number of process in that direction
! taking into account the reminder of the division.
:
IF (myid == 0) THEN ! myid is the process' rank
  ALLOCATE(Atot(Ntot(1),Ntot(2),Ntot(3))
END IF
ALLOCATE(A(N(1),N(2),N(3))
A = myid

Which is the most correct, easy and efficient way to perform the communication? I was thinking about MPI_gather: each process would send the whole array A which is made up by N(1)*N(2)*N(3) MPI_INTEGERs, and the root process should then receive them into a single MPI derived data type corresponding to a cube (MPI_type_vector should be used twice recursively, am I right?). It is possible to do so?

Even if this works, it sounds easy to me when the number of processes along each direction of the cartesian communicator evenly divides the corresponding element of Ntot, that is, when the array A has the same dimensions in each process. This is the case when Ntot = (/9,9,9/).

What about the case Ntot = (/10,10,10/)? The mpi derived data type would have different dimension in different processes, so would it be still possible to use MPI_ghather?

EDIT

I do not exclude that MPI_GATHERV could be part of the solution. However, it allows each process to send (and the root process receive) different amount of data, that is, a different number of MPI_INTEGERS (in the simple example). In the case I'm dealing with, however, the root process has to receive the data in the 3-dimensional array Atot. To do so, I think it could be useful to define an MPI derived data type, let's name it smallcube. In this case, each process sends the whole array A, whereas the master process is going to receive 1 datum of type smallcube from each process. The point is that small cube has different length along the three dimensions, depending on its position in the cartesian grid (supposing the lengths are not evenly divided by the number of process along the three dimension).

Upvotes: 1

Views: 415

Answers (1)

d_1999
d_1999

Reputation: 854

As has been mentioned in the comments, if you really do want to fetch all the data onto one processor then MPI_Type_create_subarray may be a good way of doing this. Given that I've just used MPI_Type_create_subarray in my own project I thought I would try to provide a working example answer (note I'm being loose with error checking and the types I'm declaring).

program subarrayTest
  use mpi
  implicit none
  integer, parameter :: n1 = 10, n2=20, n3=32
  INTEGER, DIMENSION(3) :: Ntot, N, sizes, subsizes, starts
  INTEGER, DIMENSION(:,:,:), ALLOCATABLE :: Atot, A
  integer :: iproc, nproc, sendSubType, ierr
  integer :: nl1, nl2, nl3 !Local block sizes
  integer :: l1, l2, l3, u1, u2, u3 !Local upper/lower bounds
  integer :: ip, sendRequest
  integer, dimension(:), allocatable :: recvSubTypes, recvRequests
  integer, dimension(:,:,:), allocatable :: boundsArr

  !MPI Setup
  call mpi_init(ierr)
  call mpi_comm_size(mpi_comm_world, nproc, ierr)
  call mpi_comm_rank(mpi_comm_world, iproc, ierr)

  !Set grid sizes
  Ntot = [n1,n2,n3]
  !For simplicity I'm assuming we only split the last dimension (and it has nproc as a factor)
  !although as long as you can specify l* and u* this should work (and hence nl* = 1+u*-l*)
  if(mod(n3,nproc).ne.0) then
     print*,"Error: n3 must have nproc as a factor."
     call mpi_abort(mpi_comm_world,MPI_ERR_UNKNOWN,ierr)
  endif
  nl1 = n1 ; l1 = 1 ; u1=l1+nl1-1
  nl2 = n2 ; l2 = 1 ; u2=l2+nl2-1
  nl3 = n3/nproc ; l3 = 1+iproc*nl3 ; u3=l3+nl3-1
  N = [nl1,nl2,nl3]

  !Very lazy way to ensure proc 0 knows the upper and lower bounds for all procs
  allocate(boundsArr(2,3,0:nproc-1)) 
  boundsArr=0
  boundsArr(:,1,iproc) = [l1, u1]
  boundsArr(:,2,iproc) = [l2, u2]
  boundsArr(:,3,iproc) = [l3, u3]
  call mpi_allreduce(MPI_IN_PLACE,boundsArr,size(boundsArr),MPI_INTEGER, &
       MPI_SUM, mpi_comm_world, ierr)

  !Allocate and populate local data portion
  IF (iproc == 0) THEN ! iproc is the process' rank
     ALLOCATE(Atot(Ntot(1),Ntot(2),Ntot(3)))
     Atot=-1 !So you can check all elements are set
  END IF
  ALLOCATE(A(N(1),N(2),N(3)))
  A = iproc

  !Now lets create the sub array types
  !First do the send type
  sizes=N !The size of the local array
  subsizes=1+[u1,u2,u3]-[l1,l2,l3] !The amount of data in each dimension to send -- here it's the full local data array but in general it could be a small subset

  starts = [0,0,0] !These are the lower bounds in each dimension where the sub array starts -- Note MPI assumes 0 indexing here.
  call mpi_type_create_subarray(size(sizes),sizes, subsizes, starts, &
       MPI_ORDER_FORTRAN, MPI_INTEGER, sendSubType, ierr)
  call mpi_type_commit(sendSubType, ierr)

  !Now on proc0 setup each receive type
  if (iproc == 0) then
     allocate(recvSubTypes(0:nproc-1)) !Use 0 indexing for ease
     sizes = Ntot !Size of dest array
     do ip=0,nproc-1
        subsizes=1+boundsArr(2,:,ip)-boundsArr(1,:,ip) !Size of A being sent from proc ip
        starts = boundsArr(1,:,ip) -1
        call mpi_type_create_subarray(size(sizes),sizes, subsizes, starts, &
             MPI_ORDER_FORTRAN, MPI_INTEGER, recvSubTypes(ip), ierr)
        call mpi_type_commit(recvSubTypes(ip), ierr)
     end do
  end if

  !Now lets use non-blocking communications to transfer data 
  !First post receives -- tag with source proc id
  if (iproc == 0) then
     allocate(recvRequests(0:nproc-1))
     do ip=0,nproc-1
        call mpi_irecv(Atot,1,recvSubTypes(ip),ip,ip,&
             mpi_comm_world,recvRequests(ip),ierr)
     end do
  end if

  !Now post sends
  call mpi_isend(A,1,sendSubType,0,iproc,mpi_comm_world,&
       sendRequest, ierr)

  !Now wait on receives/sends
  if(iproc == 0) call mpi_waitall(size(recvRequests),recvRequests,&
       MPI_STATUSES_IGNORE,ierr)
  call mpi_wait(sendRequest, MPI_STATUS_IGNORE, ierr)

  if(iproc == 0) print*,Atot
  call mpi_barrier(mpi_comm_world, ierr)

  !Now free resources -- not shown
  call mpi_finalize(ierr)
end program subarrayTest

You should be able to compile this with mpif90. You'll need to play around with this in order to set the local bounds appropriately for your case but hopefully this will provide a helpful starting point. This doesn't assume anything about the local array sizes being the same across processors, as long as the lower and upper bounds (l* and u*) are set correctly then this should work ok. Note my code above probably doesn't follow best practice in a number of ways.

Upvotes: 3

Related Questions