Ankur Gautam
Ankur Gautam

Reputation: 1422

Huge difference in MPI_Wtime() after using MPI_Barrier()?

This is the part of the code.

    if(rank==0) {   
        temp=10000; 
        var=new char[temp] ;
        MPI_Send(&temp,1,MPI_INT,1,tag,MPI_COMM_WORLD); 
        MPI_Send(var,temp,MPI_BYTE,1,tag,MPI_COMM_WORLD);
            //MPI_Wait(&req[0],&sta[1]);
    }
    if(rank==1) {
        MPI_Irecv(&temp,1,MPI_INT,0,tag,MPI_COMM_WORLD,&req[0]);
        MPI_Wait(&req[0],&sta[0]);
        var=new char[temp] ;
        MPI_Irecv(var,temp,MPI_BYTE,0,tag,MPI_COMM_WORLD,&req[1]);
        MPI_Wait(&req[0],&sta[0]);
    }
    //I am talking about this MPI_Barrier


    MPI_Barrier(MPI_COMM_WORLD);
    cout << MPI_Wtime()-t1 << endl ;
    cout << "hello " << rank  << " " << temp << endl ;
        MPI_Finalize();
}

1. when using MPI_Barrier - As expected all the process are taking almost same amount of time, which is of order 0.02

2. when not using MPI_Barrier() - the root process(sending a message) waiting for some extra time . and the (MPI_Wtime -t1) varies a lot and the time taken by root process is of order 2 seconds.

If i am not really mistaken MPI_Barrier is only used to bring all the running processes at the same level. so why don't the time when i am using MPI_Barrier() is 2 seconds (minimum of all processes . e. root process) . Please explain ?

Upvotes: 1

Views: 1828

Answers (3)

Hristo Iliev
Hristo Iliev

Reputation: 74455

Thanks to Wesley Bland for noticing that you are waiting twice on the same request. Here is an explanation of what actually happens.

There is something called progression of asynchronous (non-blocking) operations in MPI. That is when the actual transfer happens. Progression could happen in many different ways and at many different points within the MPI library. When you post an asynchronous operation, its progression could be deferred indefinitely, even until the point that one calls MPI_Wait, MPI_Test or some call that would result in new messages being pushed to or pulled from the transmit/receive queue. That's why it is very important to call MPI_Wait or MPI_Test as quickly as possible after the initiation of a non-blocking operation.

Open MPI supports a background progression thread that takes care to progress the operations even if the condition in the previous paragraph is not met, e.g. if MPI_Wait or MPI_Test is never called on the request handle. This has to be explicitly enabled when the library is being built. It is not enabled by default since background progression increases the latency of the operations.

What happens in your case is that you are waiting on the incorrect request the second time you call MPI_Wait in the receiver, therefore the progression of the second MPI_Irecv operation is postponed. The message is more than 40 KiB in size (10000 times 4 bytes + envelope overhead) which is above the default eager limit in Open MPI (32 KiB). Such messages are sent using the rendezvous protocol that requires both the send and the receive operations to be posted and progressed. The receive operation doesn't get progressed and hence the send operation in rank 0 blocks until at some point in time the clean-up routines that MPI_Finalize in rank 1 calls eventually progress the receive.

When you put the call to MPI_Barrier, it leads to the progression of the outstanding receive, acting almost like an implicit call to MPI_Wait. That's why the send in rank 0 completes quickly and both processes move on in time.

Note that MPI_Irecv, immediately followed by MPI_Wait is equivalent to simply calling MPI_Recv. The latter is not only simpler, but also less prone to simple typos like the one that you've made.

Upvotes: 3

SteVwonder
SteVwonder

Reputation: 175

In the tests that I have run, I see almost no difference in the runtimes. The main difference being that you seem to be running your code one time whereas I looped over your code thousands of times then took the average. My output is below:

With the barrier
[0]: 1.65071e-05
[1]: 1.66872e-05
Without the barrier
[0]: 1.35653e-05
[1]: 1.30711e-05

So I would assume any variation your are seeing is a result of your operating system more than your program.

Also, why are you using MPI_Irecv coupled with an MPI_wait rather than just using MPI_recv?

Upvotes: 0

Wesley Bland
Wesley Bland

Reputation: 9072

You're waiting on the same request twice for your Irecv's. the second one is the one that would take all of the time and since its getting skipped, rank 0 is getting way ahead.

MPI_BARRIER can be implemented such that some processes can leave the algorithm before the rest if the processes enter it. That's probably what's happening here.

Upvotes: 0

Related Questions