Reputation: 85
I get the error message "An error occurred in MPI_Gather" when I try to gather arrays of type double with more than 750 elements into an array that represents a matrix. The arrays that are being gathered are supposed to represent columns of the matrix, and since the matrix is constructed such that rows are contiguous in memory, I defined a derived datatype to be a column vector and called MPI_Gather like this:
for (i = 0; i < k; i++) {
MPI_Gather(&Q_vector[i*m], m, MPI_DOUBLE, &Q[i*size], 1, vector_m, 0, MPI_COMM_WORLD);
}
where k is the number of vectors, m is the length of each vector (the number of rows in the matrix), size is the number of processes and vector_m is the derived datatype that is constructed like this:
MPI_Type_vector(m, 1, n, MPI_DOUBLE, &vector_m_type);
MPI_Type_create_resized(vector_m_type, 0, sizeof(double), &vector_m);
MPI_Type_commit(&vector_m);
where n is the number of columns in the matrix.
This works fine until m > 750. If, for example, m = 751 the error occurs (751 elements of type double). It does not depend on the value of n. I changed the algorithm altogether so that the columns of the matrix are stored consecutively in memory instead to solve the problem by avoiding the derived datatype altogether, but I'm still curious as to why this happens.
Computer specs:
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
RAM: 8 GB
OS: Windows 10 Home 64-bit
Compiler: gcc 6.4.0
I use Cygwin.
This error message is sometimes printed:
" An error occurred in MPI_Gather reported by process [52635822596882433,77309411328] on communicator MPI_COMM_WORLD
MPI_ERR_IN_STATUS: error code in status MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) "
Minimal working example code to reproduce the error:
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
int n, m, size, rank, i, j, k;
double *Q, *Q_vector;
MPI_Datatype vector_m_type, vector_m;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
m = atoi(argv[1]);
n = atoi(argv[2]);
if (rank == 0) {
Q = (double *)malloc(m*n*sizeof(double));
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
Q[i*n+j] = drand48()*10;
}
}
}
// k = number of (column) vectors per process
k = n/size;
Q_vector = (double *)malloc(k*m*sizeof(double));
MPI_Type_vector(m, 1, n, MPI_DOUBLE, &vector_m_type);
MPI_Type_create_resized(vector_m_type, 0, sizeof(double), &vector_m);
MPI_Type_commit(&vector_m);
for (i = 0; i < k; i++) {
MPI_Scatter(&Q[i*size], 1, vector_m, &Q_vector[i*m], m, MPI_DOUBLE, 0, MPI_COMM_WORLD);
}
for (i = 0; i < k; i++) {
MPI_Gather(&Q_vector[i*m], m, MPI_DOUBLE, &Q[i*size], 1, vector_m, 0, MPI_COMM_WORLD);
}
if (rank == 0) {
printf("Success!\n");
free(Q);
}
free(Q_vector);
MPI_Finalize();
}
Compiled and run like this:
mpicc -o test MPI_Type_vector_test.c -lmpi -lm
mpirun -np 8 ./test 751 750
Upvotes: 4
Views: 673
Reputation: 8395
This is a known issue in Open MPI that occurs when a collective operation is using matching signatures but different datatypes (e.g. one vector on one hand, and several elements on the other hand).
The easiest way to work around this issue is to disable the coll/tuned
module
mpirun --mca coll ^tuned -np 8 ./test 751 750
An other option is to rewrite your code and use an other derived datatype that describes a row (instead of using m
elements)
Upvotes: 5