Reputation: 415
recently I am working on implement parallel LBM methods. I found that when implement MPI_index before do streaming part, it might even cause additional communication overhead. For example, when I try to use MPI_type_index to define a new MPI_type, it copy certain part from the domain which is arbitrary distributed or the block number is relative small, it might cause additional overhead.
Therefore, I would ask experienced parallel programming programmer if I understand correct?
Upvotes: 1
Views: 114
Reputation: 74375
The answer to your question is as usual: it depends. It depends on whether the network system supports gathered reads (for outgoing messages) and scattered writes (for incoming messages), as well as if heterogeneous support is enabled or not.
When the data is non-contiguous in memory and the network does not support gathered reads from the main memory, the data has to be packed before being sent, therefore an additional copy has to be made. The same applies to unpacking the data into non-contiguous regions when the network does not support scattered writes into the main memory.
When heterogeneous support is enabled, all primitive data items have to be converted into an intermediate machine-independent representation. The converted data goes into an intermediate buffer that is later sent over the network.
To elaborate on the comment from Jonathan Dursi. MPI datatypes do not incur network communication overhead themselves. It is the data packing and unpacking that takes place before of after the communication operation that the overhead results from. MPI datatypes are basically recipes for how data should be read from or written to memory when constructing or deconstructing a message. With proper network hardware that understands gathered reads and scattered writes and an MPI implementation that can program that hardware appropriately, it could be possible to translate the instructions inside an MPI datatype into a set of read or write vectors and then have the network adapter do the heavy lifting of packing and unpacking. If the network does not support such operations or the MPI implementation does not know how to offload that operation to the hardware, the packing has to happen in software and it usually involves an intermediate buffer. That's where the overhead comes from. As Jonathan Dursi has already noted, the datatype packer/unpacker routines in MPI are extremely optimised and they usually do their job as efficiently as possible (simply take a look at the Open MPI source code to see to what extent they go into tuning for best cache utilisation). Therefore, if your algorithm requires indexed datatype or any other kind of datatype with holes between the data items, just construct the appropriate MPI datatype and use it.
An example network interconnect that supports such operations is InfiniBand. Each send or operation request is provided with a list of so-called Scatter/Gather Elements (SGEs). I haven't dived that deep into the different MPI implementations and don't know if they are able to utilise SGEs in order to skip the software packing phase. This probably won't work very well with a huge number of scattered data items though.
Also notice that no packing or unpacking is necessary for contiguous datatypes with zero padding between the data elements and for arrays of such datatypes. In that case the whole memory block is sent as-is to the other process (unless the system is heterogeneous).
Upvotes: 3