simon_tulia
simon_tulia

Reputation: 397

Cannot cancel MPI requests after testing them

I am trying to create a simulation using boost library, but I encountered a problem on asynchronous communication of processes. In our case, there are 2 processes which sends/receives messages from/to each other (using isend and ireceive commands). If I wait for all send/receive commands to complete, then everything is OK. So, this is my working code:

boost::mpi::communicator* comm;
// Initialize MPI and etc.
...

std::vector<boost::mpi::request> sendRequests;
std::vector<boost::mpi::request> receiveRequests;

for(int i=0; i< 10; i++){
    receiveRequests.push_back(comm->irecv(0, 3000, receivedMessage));
    sendRequests.push_back(comm->isend(1, 3000, sentMessage));

    boost::mpi::wait_all(receiveRequests.begin(), receiveRequests.end());
    receiveRequests.clear();
}

However, I want to cancel receiving messages if it takes too much time. So, I try to test if the communication is completed or not, using test and cancel function. So, I modified my code just like below:

boost::mpi::communicator* comm;
// Initialize MPI and etc.
...

std::vector<boost::mpi::request> sendRequests;
std::vector<boost::mpi::request> receiveRequests;

for(int i=0; i< 10; i++){
    receiveRequests.push_back(comm->irecv(0, 3000, receivedMessage));
    sendRequests.push_back(comm->isend(1, 3000, sentMessage));

    vector<boost::mpi::request>::iterator it = receiveRequests.begin();
    while(it != receiveRequests.end()){
        if(!((*it).test()))
            (*it).cancel();     
        receiveRequests.erase(it);
    }
}

Now, my program crashes and I get this error after the first iteration of the loop:

terminate called after throwing an instance of 'std::length_error'
what():  vector::_M_fill_insert
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception> >'
what():  MPI_Test: Message truncated, error stack:
PMPI_Test(168)....................: MPI_Test(request=0x13bba24, flag=0x7fff081a7bd4, status=0x7fff081a7ba0) failed
MPIR_Test_impl(63)................: 
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 3000 truncated; 670 bytes received but buffer size is 577

So, I'd like to know how to resolve this error.

Upvotes: 1

Views: 556

Answers (2)

simon_tulia
simon_tulia

Reputation: 397

Finally, I figured it out. It was just because of the race condition between test and cancel methods. Since there are hundreds of message requests during the run-time, sometimes this situation occurs. After testing a request, the program cannot cancel it, because it has just finished (after the test method, but before the cancel method). That's why it occurs irregularly. So, I had to change the way what I wanted to do and remove the cancel method.

Upvotes: 0

sehe
sehe

Reputation: 393593

Where does it come from? It's nowhere

Note that push_back could reallocate and this invalidates any pending iterators.

Also note that you need to conditionally increment it in case you did the removal. The typical pattern is

 it = receiveRequests.erase(it);

Update I see you have added information to the question. It should probably be:

vector<boost::mpi::request>::iterator it = receiveRequests.begin();
while(it != receiveRequests.end()){
    if(!((*it).test()))
        (*it).cancel();     
    it = receiveRequests.erase(it);
}

I'm not sure why you always erase every receive request. I'm assuming that's the intent

Upvotes: 1

Related Questions