VishalYadav
VishalYadav

Reputation: 21

Error with MPI_Bcast

I am getting an error with MPI_Bcast (I think it is an old one) I am not sure why this is happening. The error is as follows:

An error occurred in MPI_Bcast    
on communicator MPI_COMM_WORLD  
MPI_ERR_TRUNCATE: message truncated      
MPI_ERRORS_ARE_FATAL: your MPI job will now abort

The code where it happens is:

for (int i = 0; i < nbProcs; i++){
    for (int j = firstLocalGrainRegion; j < lastLocalGrainRegion; j++){
        GrainRegion * grainRegion = microstructure->getGrainRegionAt(j);
        int grainSize = grainRegion->getBoxSize(nb);
        double * newValues;
        if (myId == i)
            newValues = grainRegion->getNewValues();
        else
            newValues = new double[grainSize]; 
        MPI_Bcast(newValues, grainSize, MPI_DOUBLE, i, MPI_COMM_WORLD);
        MPI_Barrier(MPI_COMM_WORLD);

        if (myId != i)
            grainRegion->setNewValues(newValues);
    }   
}

Upvotes: 1

Views: 4556

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74475

There are two possible reasons for the error.

The first one is that you have a pending previous MPI_Bcast, started somewhere before the outer loop, which did not complete, e.g. in a manner similar to the one in this question.

The second one is a possible buffer size mismatch because of grainRegion->getBoxSize(nb) returning different values in different processes. You can examine the code with a parallel debugger or just put a print statement before the broadcast, for example:

int grainSize = grainRegion->getBoxSize(nb);
printf("i=%d j=%d rank=%02d grainSize=%d\n", i, j, myId, grainSize);

With this particular output format, you should be able to simply run the output through sort and then quickly find mismatched values. Because of the barrier, which is always synchornising (the broadcast might not necessary be so), it is hardly possible for the different calls to MPI_Bcast to interfere with one another as in the first possible case.

If it happens so that your data structure is distributed and indeed the correct value of grainSize is only availalbe at the broadcast root process, then you should first notify the other ranks of the correct size. The simplest (but not the most efficient) solution would be to broadcast grainSize. A better solution would be to first perform an MPI_Allgather with the number of grain regions at each process (only if necessary), then perform an MPI_Allgatherv with the sizes of each region.

Upvotes: 2

Related Questions