user-2147482637
user-2147482637

Reputation: 2163

MPI Bcast or Scatter to specific ranks

I have some array of data. What I was trying to do is like this:

Use rank 0 to bcast data to 50 nodes. Each node has 1 mpi process on it with 16 cores available to that process. Then, each mpi process will call python multiprocessing. Some calculations are done, then the mpi process saves the data that was calculated with multiprocessing. The mpi process then changes some variable, and runs multiprocessing again. Etc.

So the nodes do not need to communicate with each other besides the initial startup in which they all receive some data.

The multiprocessing is not working out so well. So now I want to use all MPI.

How can I (or is it not possible) use an array of integers that refers to MPI ranks for bcast or scatter. For example, ranks 1-1000, the node has 12 cores. So every 12th rank I want to bcast the data. Then on every 12th rank, i want it to scatter data to 12th+1 to 12+12 ranks.

This requires the first bcast to communicate with totalrank/12, then each rank will be responsible for sending data to ranks on the same node, then gathering the results, saving it, then sending more data to ranks on the same node.

Upvotes: 0

Views: 1325

Answers (1)

Gilles
Gilles

Reputation: 9489

I don't know enough of mpi4py to be able to give you a code sample with it, but here is what could be a solution in C++. I'm sure you can infer a Python code out of it easily.

#include <mpi.h>
#include <iostream>
#include <cstdlib> /// for abs
#include <zlib.h>  /// for crc32

using namespace std;

int main( int argc, char *argv[] ) {

    MPI_Init( &argc, &argv );
    // get size and rank
    int rank, size;
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );

    // get the compute node name
    char name[MPI_MAX_PROCESSOR_NAME];
    int len;
    MPI_Get_processor_name( name, &len );

    // get an unique positive int from each node names
    // using crc32 from zlib (just a possible solution)
    uLong crc = crc32( 0L, Z_NULL, 0 );
    int color = crc32( crc, ( const unsigned char* )name, len );
    color = abs( color );

    // split the communicator into processes of the same node
    MPI_Comm nodeComm;
    MPI_Comm_split( MPI_COMM_WORLD, color, rank, &nodeComm );

    // get the rank on the node
    int nodeRank;
    MPI_Comm_rank( nodeComm, &nodeRank );

    // create comms of processes of the same local ranks
    MPI_Comm peersComm;
    MPI_Comm_split( MPI_COMM_WORLD, nodeRank, rank, &peersComm );

    // now, masters are all the processes of nodeRank 0
    // they can communicate among them with the peersComm
    // and with their local slaves with the nodeComm
    int worktoDo = 0;
    if ( rank == 0 ) worktoDo = 1000;
    cout << "Initially [" << rank << "] on node "
         << name << " has " << worktoDo << endl;
    MPI_Bcast( &worktoDo, 1, MPI_INT, 0, peersComm );
    cout << "After first Bcast [" << rank << "] on node "
         << name << " has " << worktoDo << endl;
    if ( nodeRank == 0 ) worktoDo += rank;
    MPI_Bcast( &worktoDo, 1, MPI_INT, 0, nodeComm );
    cout << "After second Bcast [" << rank << "] on node "
         << name << " has " << worktoDo << endl;

    // cleaning up
    MPI_Comm_free( &peersComm );
    MPI_Comm_free( &nodeComm );

    MPI_Finalize();
    return 0;
}

As you can see, you first create communicators with processes on the same node. Then you create peer communicators with all processes of the same local rank on each nodes. From than, your master process of global rank 0 will send data to the local masters. And they will distribute the work on the node they are responsible of.

Upvotes: 4

Related Questions