Mark Mc
Mark Mc

Reputation: 119

Inconsistent behavior from MPI_Gather

I am running an MPI C++ program locally with two processes: mpirun -np 2 <programname>. I am seeing inconsistent behavior of the MPI_Gather command. To test, I wrote a very short code snippet. I copied the code to the start of main and it worked fine. But when copied it to other points in the code, it sometimes gives the correct result and sometimes not. The code snippet is copied below. I doubt the issue is with the code snippet itself (since it sometimes works properly). Typically, when I see inconsistent code behavior like this, I suspect a memory corruption. However, I have run Valgrind in this case and it did not report anything amiss (although maybe I am not running Valgrind correctly for MPI - I am not experienced using Valgrind on MPI programs). What could be causing this type of inconsistent behavior and what can I do to detect the problem?

Here is the code snippet.

double val[2] = {0, 1};    
val[0] += 10.0*double(gmpirank);    
val[1] += 10.0*double(gmpirank);    
double recv[4];    
printdebug("send", val[0],val[1]);    
int err = MPI_Gather(val,2,MPI_DOUBLE,recv,2,MPI_DOUBLE,0,MPI_COMM_WORLD);    
if (gmpirank == 0) {
    printdebug("recv");    
    printdebug(recv[0],recv[1]);    
    printdebug(recv[2],recv[3]);
}    
printdebug("finished test", err);

The print debug function prints to a file, which is separate for each process, and separates the input arguments with a comma.

Process 1 prints:

send, 10, 11
finished test, 0

Sometimes, Process 0 prints:

send, 0, 1

recv

0, 1

10, 11

finished test, 0

But when I place the code in other sections of the code, Process 0 sometimes prints something like this:

send, 0, 1

recv

0, 1

2.9643938750474793e-322, 0

finished test, 0

Upvotes: 1

Views: 216

Answers (1)

Mark Mc
Mark Mc

Reputation: 119

I found the solution. As suspected, the problem was a memory corruption.

I made a beginner mistake when running Valgrind with MPI. I ran:

valgrind <options> mpirun -np 2 <programname>

instead of

mpirun -np 2 valgrind <options> <programname>

Thus, I was running valgrind on "mpirun" itself, not on the intended program. When I ran Valgrind correctly, it identified the memory corruption in an unrelated part of the code.

Kudos to another Stack Overflow Q/A for helping me figure this out: Using valgrind to spot error in mpi code

Upvotes: 1

Related Questions