Gianluca Brilli
Gianluca Brilli

Reputation: 69

MPI Deadlock with collective functions

I'm writing a simple program in C with MPI library. The intent of this program is the following:

I have a group of processes that perform an iterative loop, at the end of this loop all processes in the communicator must call two collective functions(MPI_Allreduce and MPI_Bcast). The first one sends the id of the processes that have generated the minimum value of the num.val variable, and the second one broadcasts from the source num_min.idx_v to all processes in the communicator MPI_COMM_WORLD.

The problem is that I don't know if the i-th process will be finalized before calling the collective functions. All processes have a probability of 1/10 to terminate. This simulates the behaviour of the real program that I'm implementing. And when the first process terminates, the others cause deadlock.

This is the code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

typedef struct double_int{
    double val;
    int idx_v;
}double_int;

int main(int argc, char **argv)
{
    int n = 10;
    int max_it = 4000;
    int proc_id, n_proc;double *x = (double *)malloc(n*sizeof(double));

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &n_proc);
    MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);

    srand(proc_id);

    double_int num_min;
    double_int num;

    int k;
    for(k = 0; k < max_it; k++){

        num.idx_v = proc_id;
        num.val = rand()/(double)RAND_MAX;

        if((rand() % 10) == 0){

            printf("iter %d: proc %d terminato\n", k, proc_id);

            MPI_Finalize();
            exit(EXIT_SUCCESS);
        }

        MPI_Allreduce(&num, &num_min, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
        MPI_Bcast(x, n, MPI_DOUBLE, num_min.idx_v, MPI_COMM_WORLD);
    }

    MPI_Finalize();
    exit(EXIT_SUCCESS);
}

Perhaps I should create a new group and new communicator before calling MPI_Finalize function in the if statement? How should I solve this?

Upvotes: 3

Views: 537

Answers (1)

Christian Sarofeen
Christian Sarofeen

Reputation: 2250

If you have control over a process before it terminates you should send a non-blocking flag to a rank that cannot terminate early (lets call it the root rank). Then instead of having a blocking all_reduce, you could have sends from all ranks to the root rank with their value.

The root rank could post non-blocking receives for a possible flag, and the value. All ranks would have to have sent one or the other. Once all ranks are accounted for you can do the reduce on the root rank, remove exited ranks from communication and broadcast it.

If your ranks exit without notice, I am not sure what options you have.

Upvotes: 0

Related Questions