Reputation: 1020
I setup this algorithm to share data between different processors, and it has worked so far, but I'm trying to throw a much larger problem at it and I'm witnessing some very strange behavior. I'm losing pieces of data between MPI_Isend's and MPI_Recv's.
I present a snippet of the code below. It is basically comprised of three stages. First, a processor will loop over all elements in a given array. Each element represents a cell in a mesh. The processor checks if the element is being used on other processors. If yes, it does a non-blocking send to that process using the cell's unique global ID as the tag. If no, it checks the next element, and so on.
Second, the processor then loops over all elements again, this time checking if the processor needs to update the data in that cell. If yes, then the data has already been sent out by another process. The current process simply does a blocking receive, knowing who owns the data and the unique global ID for that cell.
Finally, MPI_Waitall is used for the request codes that were stored in the 'req' array during the non-blocking sends.
The issue I'm having is that this entire process completes---there is no hang in the code. But some of the data being received by some of the cells just isn't correct. I check that all data being sent is right by printing each piece of data prior to the send operation. Note that I'm sending and receiving a slice of an array. Each send will pass 31 elements. When I print the array from the process that received it, 3 out of the 31 elements are garbage. All other elements are correct. The strange thing is that it is always the same three elements that are garbage---the first, second and last element.
I want to rule out that something isn't drastically wrong in my algorithm which would explain this. Or perhaps it is related to the cluster I'm working on? As I mentioned, this worked on all other models I threw at it, using up to 31 cores. I'm getting this behavior when I try to throw 56 cores at the problem. If nothing pops out as wrong, can you suggest a means to test why certain pieces of a send are not making it to their destination?
do i = 1, num_cells
! skip cells with data that isn't needed by other processors
if (.not.needed(i)) cycle
tag = gid(i) ! The unique ID of this cell in the entire system
ghoster = ghosts(i) ! The processor that needs data from this cell
call MPI_Isend(data(i,1:tot_levels),tot_levels,mpi_datatype,ghoster,tag,MPI_COMM,req(send),mpierr)
send = send + 1
end do
sends = send-1
do i = 1, num_cells
! skip cells that don't need a data update
if (.not.needed_here(i)) cycle
tag = gid(i)
owner = owner(i)
call MPI_Recv(data(i,1:tot_levels),tot_levels,mpi_datatype,owner,tag,MPI_COMM,MPI_STATUS_IGNORE,mpierr)
end do
call MPI_Waitall(sends,req,MPI_STATUSES_IGNORE,mpierr)
Upvotes: 1
Views: 225
Reputation: 1020
I figured out a way to get my code to work, but I'm not entirely sure why, so I'm going to post the solution here and maybe somebody could comment on why this is the case and possibly offer a better solution.
As I indicated in my question and as we have discussed in the comments, it appeared that pieces of data were being lost between sends/receives. The concept of the buffer is a mystery to me, but I thought that maybe there wasn't enough space to hold my Isends, allowing for them to get lost before they could be received. So I swapped out the MPI_Isend calls with MPI_Bsend calls. I figure out how big my buffer needs to be using MPI_Pack_size. This way, I know I will have ample space for all my messages I send. I allocate my buffer size using MPI_Buffer_attach. I got rid of the MPI_Waitall, since it is no longer needed, and I replaced it with a call to MPI_Buffer_detach.
The code runs without issue and arrives at identical results to the serial case. I'm able to scale the problem size up to what I tried before and it works now. So based on these results, I'd have to assume that pieces of messages were being lost due to insufficient buffer space.
I have concerns about the impact on code performance. I did a scaling study on different problem sizes. See the image below. The x-axis gives the size of the problem (5 means the problem is 5 times bigger than 1). The y-axis gives the time to finish executing the program. There are three lines shown. Running the program in serial is shown in blue. The size=1 case is extrapolated out linearly with the green line. We see that the code execution time is linearly correlated with problem size. The red line shows running the program in parallel---we use a number of processors that matches the problem size (e.g. 2 cores for size=2, 4 cores for size=4, etc.).
You can see that the parallel execution time increases very slowly with problem size, which is expected, except for the largest case. I feel that the poor performance for the largest case is being caused by an increased amount of message buffering, which was not needed in smaller cases.
Upvotes: 0
Reputation: 9082
Is your problem that you're not receiving all of the messages? Note that just because an MPI_SEND or MPI_ISEND completes, doesn't mean that the corresponding MPI_RECV was actually posted/completed. The return of the send call only means that the buffer can be reused by the sender. That data may still be buffered internally somewhere on either the sender or the receiver.
If it's critical that you know that the message was actually received, you need to use a different variety of the send like MPI_SSEND or MPI_RSEND (or the nonblocking versions if you prefer). Note that this won't actually solve your problem. It will probably just make it easier for you to figure out which messages aren't showing up.
Upvotes: 1