Reputation: 37
I am trying to write a C++ program by using MPI, in which each rank will send a matrix to rank 0. When the matrix size is relatively small, the code works perfectly. However, when the matrix size becomes big. The code starts to give strange error which will only happen when I use specific amount of CPUs.
If you feel the full code is too long, please directly go down to the minimum example below.
To avoid overlooking some part, I give the full source code here:
#include <iostream>
#include <mpi.h>
#include <cmath>
int world_size;
int world_rank;
MPI_Comm comm;
int m, m_small, m_small2;
int index(int row, int column)
{
return m * row + column;
}
int index3(int row, int column)
{
return m_small2 * row + column;
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
m = atoi(argv[1]); //Size
int ndims = 2;
int *dims = new int[ndims];
int *period = new int[ndims];
int *coords = new int[ndims];
for (int i=0; i<ndims; i++) dims[i] = 0;
for (int i=0; i<ndims; i++) period[i] = 0;
for (int i=0; i<ndims; i++) coords[i] = 0;
MPI_Dims_create(world_size, ndims, dims);
MPI_Cart_create(MPI_COMM_WORLD, ndims, dims, period, 0, &comm);
MPI_Cart_coords(comm, world_rank, ndims, coords);
double *a, *a_2;
if (0 == world_rank) {
a = new double [m*m];
for (int i=0; i<m; i++) {
for (int j=0; j<m; j++) {
a[index(i,j)] = 0;
}
}
}
/*m_small is along the vertical direction, m_small2 is along the horizental direction*/
//The upper cells will take the reminder of total lattice point along vertical direction divided by the cell number along that direction
if (0 == coords[0]){
m_small = int(m / dims[0]) + m % dims[0];
}
else m_small = int(m / dims[0]);
//The left cells will take the reminder of total lattice point along horizental direction divided by the cell number along that direction
if (0 == coords[1]) {
m_small2 = int(m / dims[1]) + m % dims[1];
}
else m_small2 = int(m / dims[1]);
double *a_small = new double [m_small * m_small2];
/*Initialization of matrix*/
for (int i=0; i<m_small; i++) {
for (int j=0; j<m_small2; j++) {
a_small[index3(i,j)] = 2.5 ;
}
}
if (0 == world_rank) {
a_2 = new double[m_small*m_small2];
for (int i=0; i<m_small; i++) {
for (int j=0; j<m_small2; j++) {
a_2[index3(i,j)] = 0;
}
}
}
int loc[2];
int m1_rec, m2_rec;
MPI_Request send_req;
MPI_Isend(coords, 2, MPI_INT, 0, 1, MPI_COMM_WORLD, &send_req);
//This Isend may have problem!
MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
if (0 == world_rank) {
for (int i = 0; i < world_size; i++) {
MPI_Recv(loc, 2, MPI_INT, i, 1, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
/*Determine the size of matrix for receiving the information*/
if (0 == loc[0]) {
m1_rec = int(m / dims[0]) + m % dims[0];
} else {
m1_rec = int(m / dims[0]);
}
if (0 == loc[1]) {
m2_rec = int(m / dims[1]) + m % dims[1];
} else {
m2_rec = int(m / dims[1]);
}
//This receive may have problem!
MPI_Recv(a_2, m1_rec * m2_rec, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
delete[] a_small;
if (0 == world_rank) {
delete[] a;
delete[] a_2;
}
delete[] dims;
delete[] period;
delete[] coords;
MPI_Finalize();
return 0;
}
Basically, the code reads an input value m
, and then construct a big matrix of m x m
. MPI creates a 2D topology according to the number of CPUs, which divide the big matrix to sub-matrix. The size of the sub-matrix is m_small x m_small2
. There should be no problem in these steps.
The problem happens when I send the sub-matrix in each rank to rank-0 using MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
and MPI_Recv(a_2, m1_rec * m2_rec, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
.
For example, when I run the code by this command: mpirun -np 2 ./a.out 183
, I will get the error of
Read -1, expected 133224, errno = 14
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x7fb23b485010
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node dx1-500-24164 exited on signal 11 (Segmentation fault).
Strangely, If I modify the CPU number or decrease the value of input argument, the problem is not there anymore. Also, If I just comment out the MPI_Isend/Recv, there is no problem either.
So I am really wondering how to solve this problem?
Edit.1 The minimum example to reproduce the problem. When the size of matrix is small, there is no problem. But problem comes when you increase the size of matrix (at least for me):
#include <iostream>
#include <mpi.h>
#include <cmath>
int world_size;
int world_rank;
MPI_Comm comm;
int m, m_small, m_small2;
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
m = atoi(argv[1]); //Size
double *a_2;
//Please increase the size of m_small and m_small2 and wait for the problem to happen
m_small = 100;
m_small2 = 200;
double *a_small = new double [m_small * m_small2];
if (0 == world_rank) {
a_2 = new double[m_small*m_small2];
}
MPI_Request send_req;
MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
if (0 == world_rank) {
for (int i = 0; i < world_size; i++) {
MPI_Recv(a_2, m_small*m_small2, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
delete[] a_small;
if (0 == world_rank) {
delete[] a_2;
}
MPI_Finalize();
return 0;
}
Command to run mpirun -np 2 ./a.out 183
(The input argument is actually not used the by code this time)
Upvotes: 0
Views: 345
Reputation: 1055
The problem is in the line
MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
MPI_Isend
is non-blocking send (which you pair with blocking MPI_Recv
), thus when it returns it still may use a_small
until you wait for the send to complete (when you are free to use a_small
again) using, e.g., MPI_Wait(&send_req, MPI_STATUS_IGNORE);
. So, you then delete a_small
while it may still be in use by non-blocking message sending code, which likely causes access of deleted memory, which can lead to segfault and crash. Try using blocked send like this:
MPI_Send(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
This will return when a_small
can be reused (including by deletion), though data may still not be recieved by recievers by that time, but rather held in an internal temporary buffer.
Upvotes: 2