Reputation: 28416
Suppose a program is run on xp
times yp
times zp
processes.
A cartesian communicator is used such that the processes can be thought to be arranged in a grid of dimensions (xp,yp,zp).
In this program the root process (0) declares and allocates a 3D array Atot
which is going to be filled by 3D arrays A
declared by each process (root included).
INTEGER, DIMENSION(3) :: Ntot
INTEGER, DIMENSION(3) :: N
INTEGER, DIMENSION(:,:,:), ALLOCATABLE :: Atot
INTEGER, DIMENSION(:,:,:), ALLOCATABLE :: A
:
! the 3 elements of the array N are determined by dividing the corresponding
! element of the array Ntot by the number of process in that direction
! taking into account the reminder of the division.
:
IF (myid == 0) THEN ! myid is the process' rank
ALLOCATE(Atot(Ntot(1),Ntot(2),Ntot(3))
END IF
ALLOCATE(A(N(1),N(2),N(3))
A = myid
Which is the most correct, easy and efficient way to perform the communication?
I was thinking about MPI_gather
: each process would send the whole array A
which is made up by N(1)*N(2)*N(3)
MPI_INTEGER
s, and the root process should then receive them into a single MPI derived data type corresponding to a cube (MPI_type_vector
should be used twice recursively, am I right?).
It is possible to do so?
Even if this works, it sounds easy to me when the number of processes along each direction of the cartesian communicator evenly divides the corresponding element of Ntot
, that is, when the array A
has the same dimensions in each process. This is the case when Ntot = (/9,9,9/)
.
What about the case Ntot = (/10,10,10/)
? The mpi derived data type would have different dimension in different processes, so would it be still possible to use MPI_ghather?
EDIT
I do not exclude that MPI_GATHERV
could be part of the solution. However, it allows each process to send (and the root process receive) different amount of data, that is, a different number of MPI_INTEGERS (in the simple example). In the case I'm dealing with, however, the root process has to receive the data in the 3-dimensional array Atot
. To do so, I think it could be useful to define an MPI derived data type, let's name it smallcube
. In this case, each process sends the whole array A
, whereas the master process is going to receive 1 datum of type smallcube
from each process. The point is that small cube
has different length along the three dimensions, depending on its position in the cartesian grid (supposing the lengths are not evenly divided by the number of process along the three dimension).
Upvotes: 1
Views: 415
Reputation: 854
As has been mentioned in the comments, if you really do want to fetch all the data onto one processor then MPI_Type_create_subarray
may be a good way of doing this. Given that I've just used MPI_Type_create_subarray
in my own project I thought I would try to provide a working example answer (note I'm being loose with error checking and the types I'm declaring).
program subarrayTest
use mpi
implicit none
integer, parameter :: n1 = 10, n2=20, n3=32
INTEGER, DIMENSION(3) :: Ntot, N, sizes, subsizes, starts
INTEGER, DIMENSION(:,:,:), ALLOCATABLE :: Atot, A
integer :: iproc, nproc, sendSubType, ierr
integer :: nl1, nl2, nl3 !Local block sizes
integer :: l1, l2, l3, u1, u2, u3 !Local upper/lower bounds
integer :: ip, sendRequest
integer, dimension(:), allocatable :: recvSubTypes, recvRequests
integer, dimension(:,:,:), allocatable :: boundsArr
!MPI Setup
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world, nproc, ierr)
call mpi_comm_rank(mpi_comm_world, iproc, ierr)
!Set grid sizes
Ntot = [n1,n2,n3]
!For simplicity I'm assuming we only split the last dimension (and it has nproc as a factor)
!although as long as you can specify l* and u* this should work (and hence nl* = 1+u*-l*)
if(mod(n3,nproc).ne.0) then
print*,"Error: n3 must have nproc as a factor."
call mpi_abort(mpi_comm_world,MPI_ERR_UNKNOWN,ierr)
endif
nl1 = n1 ; l1 = 1 ; u1=l1+nl1-1
nl2 = n2 ; l2 = 1 ; u2=l2+nl2-1
nl3 = n3/nproc ; l3 = 1+iproc*nl3 ; u3=l3+nl3-1
N = [nl1,nl2,nl3]
!Very lazy way to ensure proc 0 knows the upper and lower bounds for all procs
allocate(boundsArr(2,3,0:nproc-1))
boundsArr=0
boundsArr(:,1,iproc) = [l1, u1]
boundsArr(:,2,iproc) = [l2, u2]
boundsArr(:,3,iproc) = [l3, u3]
call mpi_allreduce(MPI_IN_PLACE,boundsArr,size(boundsArr),MPI_INTEGER, &
MPI_SUM, mpi_comm_world, ierr)
!Allocate and populate local data portion
IF (iproc == 0) THEN ! iproc is the process' rank
ALLOCATE(Atot(Ntot(1),Ntot(2),Ntot(3)))
Atot=-1 !So you can check all elements are set
END IF
ALLOCATE(A(N(1),N(2),N(3)))
A = iproc
!Now lets create the sub array types
!First do the send type
sizes=N !The size of the local array
subsizes=1+[u1,u2,u3]-[l1,l2,l3] !The amount of data in each dimension to send -- here it's the full local data array but in general it could be a small subset
starts = [0,0,0] !These are the lower bounds in each dimension where the sub array starts -- Note MPI assumes 0 indexing here.
call mpi_type_create_subarray(size(sizes),sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, sendSubType, ierr)
call mpi_type_commit(sendSubType, ierr)
!Now on proc0 setup each receive type
if (iproc == 0) then
allocate(recvSubTypes(0:nproc-1)) !Use 0 indexing for ease
sizes = Ntot !Size of dest array
do ip=0,nproc-1
subsizes=1+boundsArr(2,:,ip)-boundsArr(1,:,ip) !Size of A being sent from proc ip
starts = boundsArr(1,:,ip) -1
call mpi_type_create_subarray(size(sizes),sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, recvSubTypes(ip), ierr)
call mpi_type_commit(recvSubTypes(ip), ierr)
end do
end if
!Now lets use non-blocking communications to transfer data
!First post receives -- tag with source proc id
if (iproc == 0) then
allocate(recvRequests(0:nproc-1))
do ip=0,nproc-1
call mpi_irecv(Atot,1,recvSubTypes(ip),ip,ip,&
mpi_comm_world,recvRequests(ip),ierr)
end do
end if
!Now post sends
call mpi_isend(A,1,sendSubType,0,iproc,mpi_comm_world,&
sendRequest, ierr)
!Now wait on receives/sends
if(iproc == 0) call mpi_waitall(size(recvRequests),recvRequests,&
MPI_STATUSES_IGNORE,ierr)
call mpi_wait(sendRequest, MPI_STATUS_IGNORE, ierr)
if(iproc == 0) print*,Atot
call mpi_barrier(mpi_comm_world, ierr)
!Now free resources -- not shown
call mpi_finalize(ierr)
end program subarrayTest
You should be able to compile this with mpif90
. You'll need to play around with this in order to set the local bounds appropriately for your case but hopefully this will provide a helpful starting point. This doesn't assume anything about the local array sizes being the same across processors, as long as the lower and upper bounds (l*
and u*
) are set correctly then this should work ok. Note my code above probably doesn't follow best practice in a number of ways.
Upvotes: 3