Reputation: 47
Have a problem with synchronization of values shared as MPI memory window. Reason for using shared memory is that memory structure is too large to have a copy on every process, but the calculation of its elements needs to be distributed. So, idea is to have only one data structure per node.
Here is the simplified version of the code which contains minimal subset which should describe the problem. I skip the part where I do synchronization between nodes.
I have two problems:
I've tried with active target synchronization (MPI_Win_Fence()), but the same problems occur. Since I don't have many experiences with this, could be that I simply use the wrong approach.
MPI_Comm nodecomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
MPI_INFO_NULL, &nodecomm);
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
int local_xx_size = 0;
if (noderank == 0){
local_xx_size = xx_size;
}
MPI_Win win_xx;
MPI_Aint winsize;
double *xx, *local_xx;
MPI_Win_allocate_shared(local_xx_size*sizeof(double), sizeof(double),
MPI_INFO_NULL, nodecomm, &local_xx, &win_xx);
xx = local_xx;
if (noderank != 0){
MPI_Win_shared_query(win_xx, 0, &winsize, &windisp, &xx);
}
//init xx
if(noderank == 0){
MPI_Win_lock_all(0, win_xx);
for (i=0; i<xx_size; i++){
xx[i]=0.0;
}
MPI_Win_unlock_all(win_xx);
}
MPI_Barrier(nodecomm);
long counter = 0;
for(i = 0; i < largeNum; i++) {
//some calculations
for(j = 0; j < xx_size; j++) {
//calculate res
MPI_Win_lock_all(0, win_xx);
xx[counter] += res; //update value
MPI_Win_unlock_all(win_xx);
}
}
MPI_Barrier(nodecomm);
//use xx (sync data from all the nodes)
MPI_Win_free(&win_xx);
I would appreciate any help and suggestion regarding these problems.
Upvotes: 1
Views: 787
Reputation: 5662
MPI lock/unlock do not themselves cause atomic updates.
You shouldn't use lock/unlock more than necessary anyways. Use flush instead. Only lock and unlock the window when you allocate and free it.
You can get atomicity using MPI accumulate functions (Accumulate, Get_accumulate, Fetch_and_op, Compare_and_swap) or - and only in the case of shared memory - you can use atomic primitives associated with your compiler. Because this is a bit tricky with C11/C++11 because they require types, I show the example below with intrinsics that are supposed by most if not all common compilers.
I do not know if this is correct. It merely demonstrates the concepts noted above.
MPI_Comm nodecomm;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
MPI_INFO_NULL, &nodecomm);
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
int local_xx_size = 0;
if (noderank == 0){
local_xx_size = xx_size;
}
MPI_Win win_xx;
MPI_Aint winsize;
double *xx, *local_xx;
MPI_Win_allocate_shared(local_xx_size*sizeof(double), sizeof(double), MPI_INFO_NULL, nodecomm, &local_xx, &win_xx);
MPI_Win_lock_all(0, win_xx);
xx = local_xx;
if (noderank != 0){
MPI_Win_shared_query(win_xx, 0, &winsize, &windisp, &xx);
}
//init xx
if(noderank == 0){
for (i=0; i<xx_size; i++){
xx[i]=0.0;
}
}
MPI_Barrier(nodecomm);
long counter = 0;
for(i = 0; i < largeNum; i++) {
//some calculations
for(j = 0; j < xx_size; j++) {
//calculate res
// xx[counter] += res; //update value
#ifdef USE_RMA_ATOMICS
// check the arguments - I don't know if I calculate the target+displacement right
int target = counter/local_xx_size;
MPI_Aint disp = counter%local_xx_size;
MPI_Accumulate(&res, MPI_LONG, target, disp, 1, MPI_LONG, MPI_SUM, win_xx);
MPI_Win_flush(target, win_xx);
#else
# ifdef USE_NEWER_INTRINSICS // GCC, Clang, Intel support this AFAIK
__atomic_fetch_add (&xx[counter], res, __ATOMIC_RELAXED);
# else // GCC, Clang, Intel, IBM support this AFAIK
__sync_fetch_and_add(&xx[counter], res);
# endof
#endif
}
}
MPI_Barrier(nodecomm);
MPI_Win_unlock_all(win_xx);
MPI_Win_free(&win_xx);
Upvotes: 1