Why does MPI_Barrier does not stop at the same time for workers across different nodes?

Question

I used the following piece of codes to synchronize between working nodes on different boxes:

MPI_Barrier(MPI_COMM_WORLD); 

gettimeofday(&time[0], NULL); 
printf("RANK: %d starts at %lld sec, %lld usec
",rank, time[0].tv_sec, time[0].tv_usec);

when I run two tasks in the same node, the starting time are quite close:

RANK: 0 starts at 1379381886 sec, 27296 usec
RANK: 1 starts at 1379381886 sec, 27290 usec

However, when I run two tasks across two different nodes, I ended up with more different starting time:

RANK: 0 starts at 1379381798 sec, 720113 usec
RANK: 1 starts at 1379381798 sec, 718676 usec

Is the following different reasonable? Or it implies some communication issue between nodes?

willeM_ Van Onsem · Accepted Answer

A barrier means that different nodes will synchronize. They do this by exchanging messages. However once a node has received message from all the other nodes that they reached the barrier, that node will continue. There is no reason to wait executing code since barriers are mainly used to guarantee for instance all nodes have processed data, not to synchronize nodes in time...

One can never synchronize nodes in time. Only by using strict protocols like the simple time protocol (STP), one can guarantee clocks are set approximately equal.

A barrier was introduced to guarantee code before the barrier is executed before nodes start executing something else.

For instance let's say all nodes execute the following code:

MethodA();
MPI_Barrier();
MethodB();

Then you can be sure that if a node executes MethodB, all other nodes have executed MethodA, however you don't know anything about how much they have already processed of MethodB.

Latency has a high influence on the execution time. Say for instance machineA was somehow faster than machineB (we assume a WORLD with two machines and time differences can be caused by caching,...) If machine A reaches the barrier, it will send a message to machineB and wait for a message of machineB which says machineB has reached the barrier as well. In a next timeframe machineB reaches the barrier as well and sends a message to machineA. However machineB can immidiately continue process data since it already received the message of machineA. machineA however, must wait until the message arrives. Of course this message will arrive quite soon, but cause some timedifference. Furthermore it's not guaranteed that the message will be received correctly the first time. If machineA does not confirm it received the message, machineB will resend it after a while causing more and more delay. However in a LAN network packet loss is very unlikely.

One can thus claim the time of transmission (latency) has an impact on the difference in time.

Why does MPI_Barrier does not stop at the same time for workers across different nodes?

Answers (1)

Related Questions