mpi4py performance discrepancy after profiling

Question

I have been doing some work with MPI4py arrays, and I recently came across the performance increase after using Scatterv() functions. I have developed a code to inspect the data type of an input object and, in case it is a numeric numpy array, it performs the scattering with Scatterv(), otherwise it does so with a proper-implemented function.

The code looks like this:

import numpy as np
from mpi4py import MPI
import cProfile
from line_profiler import LineProfiler

def ScatterV(object, comm, root = 0):
    optimize_scatter, object_type = np.zeros(1), None

    if rank == root:
        if isinstance(object, np.ndarray):
            if object.dtype in [np.float64, np.float32, np.float16, np.float,
                                np.int, np.int8, np.int16, np.int32, np.int64]:
                optimize_scatter = 1
                object_type = object.dtype

            else: optimize_scatter, object_type = 0, None
        else: optimize_scatter, object_type = 0, None

        optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()

    comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
    object_type = comm.bcast(object_type, root=root)

    if int(optimize_scatter) == 1:

        if rank == root:

            displs = [int(i)*object.shape[1] for i in
                          np.linspace(0, object.shape[0], comm.size + 1)]
            counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
            lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
                        for i in range(len(displs)-1)]
            displs = displs[:-1]
            shape = object.shape

            object = object.ravel().astype(np.float64, copy=False)

        else:
            object, counts, displs, shape, lens = None, None, None, None, None

        counts = comm.bcast(counts, root=root)
        displs = comm.bcast(displs, root=root)
        lens = comm.bcast(lens, root=root)
        shape = list(comm.bcast(shape, root=root))

        shape[0] = lens[rank]
        shape = tuple(shape)

        x = np.zeros(counts[rank])

        comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)


        return  np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)

    else:
        return comm.scatter(object, root=root)


comm = MPI.COMM_WORLD
size, rank = comm.Get_size(), comm.Get_rank()



if rank == 0:
    arra = (np.random.rand(10000000, 10) * 100).astype(np.float64, copy=False)
else: arra = None

lp = LineProfiler()

lp_wrapper = lp(ScatterV)
lp_wrapper(arra, comm)

if rank == 4: lp.print_stats()


pr = cProfile.Profile()
pr.enable()

f2 = ScatterV(arra, comm)

pr.disable()

if rank == 4: pr.print_stats()

The analysis with LineProfiler yields the following results [cut to show conflictive lines only]:

Timer unit: 1e-06 s

Total time: 2.05001 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...                                          
    41         1    1708453.0 1708453.0     83.3      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
    42         1        148.0    148.0      0.0      object_type = comm.bcast(object_type, root=root)
   ...                                
    76         1        264.0    264.0      0.0          counts = comm.bcast(counts, root=root)
    77         1         16.0     16.0      0.0          displs = comm.bcast(displs, root=root)
    78         1         14.0     14.0      0.0          lens = comm.bcast(lens, root=root)
    79         1          9.0      9.0      0.0          shape = list(comm.bcast(shape, root=root))
 ...                                
    86         1     340971.0 340971.0     16.6          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

The analysis with cProfile yields the following results:

         17 function calls in 0.462 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.127    0.127    0.127    0.127 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
        1    0.335    0.335    0.335    0.335 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
        5    0.000    0.000    0.000    0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

In both cases, the Bcast method consumes a lot of time in comparison with ScatterV method. Even more, with LinePprofiler, the Bcast method is 5 times slower than ScatterV method, which seems completely incoherent to me since Bcast is only broadcasting an array of 10 elements.

If I swap the lines 41 and 42, these are the results:

LineProfiler

41         1    1666718.0 1666718.0     83.0      object_type = comm.bcast(object_type, root=root)
42         1         47.0     47.0      0.0      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
87         1     341728.0 341728.0     17.0          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

cProfile

1    0.000    0.000    0.000    0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1    0.339    0.339    0.339    0.339 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5    0.129    0.026    0.129    0.026 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

If I vary the size of the array to be scattered, the time consumption of ScatterV and Bcast also vary, at the same rate. For instance, if I increase the size 10 times (100000000), the results are:

LineProfiler

41         1   16304301.0 16304301.0     82.8      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42         1        235.0    235.0      0.0      object_type = comm.bcast(object_type, root=root)
87         1    3393658.0 3393658.0     17.2          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

cProfile

 1    1.348    1.348    1.348    1.348 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
    1    4.517    4.517    4.517    4.517 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
    5    0.000    0.000    0.000    0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

If instead of selecting the results for rank 4, I select them for any rank > 1, the same result happens. However, for rank = 0 the results differ:

LineProfiler

41         1        186.0    186.0      0.0      comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42         1        244.0    244.0      0.0      object_type = comm.bcast(object_type, root=root)
87         1    4722349.0 4722349.0    100.0          comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)

cProfile

    1    0.000    0.000    0.000    0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
    1    5.921    5.921    5.921    5.921 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
    5    0.000    0.000    0.000    0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}

In this case, the Bcast method has a similar computation time as the rest of bcast methods.

I have also tried to, instead of using Bcaston line 41, using bcast and scatter, which yield the same results.

Given that, I think that the increased time consumption is attributed erroneously only to the first broadcast, which means that both profilers yield false timings for parallelization processes.

I am quite sure that the internal structure of profilers is not done to work with parallelizable functions, but I post this question to know if someone has experienced similar results.

mpi4py performance discrepancy after profiling

Answers (1)

Related Questions