user1337
user1337

Reputation: 175

MPI returns incorrect results for one of processes

I'm learning MPI now and wrote simple C program that uses MPI_Scatter and MPI_Reduce as follows:

int main(int argc, char **argv)
{
        int mpirank, mpisize;
        int tabsize = atoi(*(argv + 1));

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpisize);

        unsigned long int sum = 0;
        int rcvsize = tabsize / mpisize;
        int *rcvbuf = malloc(rcvsize * sizeof(int));
        int *tab = malloc(tabsize * sizeof(int));
        int totalsum = 0;

        if(mpirank == 0){

                for(int i=0; i < tabsize; i++){
                        *(tab + i) = 1;
                }


        }
                MPI_Scatter(tab, tabsize/mpisize, MPI_INT, rcvbuf, tabsize/mpisize, MPI_INT, 0, MPI_COMM_WORLD);
   
        for(int i=0; i < tabsize/mpisize; i++){
                sum += *(rcvbuf + i);
        }

        printf("%d sum = %ld %d\n", mpirank, sum, tabsize/mpisize);
        MPI_Reduce(&sum, &totalsum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

        if(mpirank == 0){
                printf("The totalsum = %li\n", totalsum);
        }

        MPI_Finalize();

        return 0;
}

The program gives inconsitient results and I don't understand why. For example:

$ mpirun -np 4 03_array_sum 120000000
1 sum = 29868633 30000000
2 sum = 30000000 30000000
0 sum = 30000000 30000000
3 sum = 30000000 30000000
The totalsum = 119868633

Here process 1 didn't count all elements given to it by MPI_Scatter.

UPDATE: As user @Gilles Gouaillardet wrote below in the accepted answer I have run the code in a loop thirty times for both versions, with empty $OMPI_MCA_pml and set to "^ucx". When flag is empty for one run 8 out of 30 gives wrong values, when flag is set all runs are correct. Then I run same on Debian GNU/Linux 7 (wheezy) with OpenMPI 1.4.5 and all runs were correct with empty flag. Looks like something is wrong with OpenMPI 4.0.4 and/or Fedora 33.

Upvotes: 2

Views: 279

Answers (1)

Gilles Gouaillardet
Gilles Gouaillardet

Reputation: 8395

I was able to reproduce the issue in the very same environment.

I do not know whether the root cause is within Open MPI or UCX.

Meanwhile, you can

mpirun --mca pml ^ucx ...

or

export OMPI_MCA_pml=^ucx
mpirun ...

or add into /etc/openmpi-x86_64/openmpi-mca-params.conf

pml = ^ucx

Upvotes: 1

Related Questions