Reputation: 175
I'm learning MPI now and wrote simple C program that uses MPI_Scatter and MPI_Reduce as follows:
int main(int argc, char **argv)
{
int mpirank, mpisize;
int tabsize = atoi(*(argv + 1));
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);
MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
unsigned long int sum = 0;
int rcvsize = tabsize / mpisize;
int *rcvbuf = malloc(rcvsize * sizeof(int));
int *tab = malloc(tabsize * sizeof(int));
int totalsum = 0;
if(mpirank == 0){
for(int i=0; i < tabsize; i++){
*(tab + i) = 1;
}
}
MPI_Scatter(tab, tabsize/mpisize, MPI_INT, rcvbuf, tabsize/mpisize, MPI_INT, 0, MPI_COMM_WORLD);
for(int i=0; i < tabsize/mpisize; i++){
sum += *(rcvbuf + i);
}
printf("%d sum = %ld %d\n", mpirank, sum, tabsize/mpisize);
MPI_Reduce(&sum, &totalsum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if(mpirank == 0){
printf("The totalsum = %li\n", totalsum);
}
MPI_Finalize();
return 0;
}
The program gives inconsitient results and I don't understand why. For example:
$ mpirun -np 4 03_array_sum 120000000
1 sum = 29868633 30000000
2 sum = 30000000 30000000
0 sum = 30000000 30000000
3 sum = 30000000 30000000
The totalsum = 119868633
Here process 1 didn't count all elements given to it by MPI_Scatter.
UPDATE: As user @Gilles Gouaillardet wrote below in the accepted answer I have run the code in a loop thirty times for both versions, with empty $OMPI_MCA_pml and set to "^ucx". When flag is empty for one run 8 out of 30 gives wrong values, when flag is set all runs are correct. Then I run same on Debian GNU/Linux 7 (wheezy) with OpenMPI 1.4.5 and all runs were correct with empty flag. Looks like something is wrong with OpenMPI 4.0.4 and/or Fedora 33.
Upvotes: 2
Views: 279
Reputation: 8395
I was able to reproduce the issue in the very same environment.
I do not know whether the root cause is within Open MPI or UCX.
Meanwhile, you can
mpirun --mca pml ^ucx ...
or
export OMPI_MCA_pml=^ucx
mpirun ...
or add into /etc/openmpi-x86_64/openmpi-mca-params.conf
pml = ^ucx
Upvotes: 1