Tor
Tor

Reputation: 757

Poor performance from numba.prange

I'm trying to put together a simple example to illustrate the benefits of using numba.prange for myself and some colleagues, but I'm unable to get a decent speedup. I coded up a simple 1D diffusion solver, which essentially loops over a long array, combining elements i+1, i, and i-1, and writing the results into element i of a second array. This should be a pretty perfect use case for a parallel for loop, similar to OpenMP in Fortran or C.

My complete example is included below:

import numpy as np
from numba import jit, prange

@jit(nopython=True, parallel=True)
def diffusion(Nt):
    alpha = 0.49
    x = np.linspace(0, 1, 10000000)
    # Initial condition
    C = 1/(0.25*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-0.5)/0.25)**2)
    # Temporary work array
    C_ = np.zeros_like(C)
    # Loop over time (normal for-loop)
    for j in range(Nt):
        # Loop over array elements (space, parallel for-loop)
        for i in prange(1, len(C)-1):
            C_[i] = C[i] + alpha*(C[i+1] - 2*C[i] + C[i-1])
        C[:] = C_
    return C

# Run once to just-in-time compile
C = diffusion(1)

# Check timing
%timeit C = diffusion(100)

When running with parallel=False, this takes about 2 second, and with parallel=True it takes about 1.5 seconds. I'm running on a MacBook Pro with 4 physical cores, and Activity Monitor reports 100% and around 700% CPU usage with and without parallelisation.

I would have expected something closer to a factor 4 speedup. Am I doing something wrong?

Upvotes: 3

Views: 469

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50308

The poor scalability certainly comes from the saturation of the RAM which is shared by all cores on a desktop machine. Indeed, your code is memory-bound and the memory throughput is pretty limited on modern machines compared to the computing power of CPUs (or GPUs). As a result, 1 or 2 cores are generally enough to saturate the RAM on most desktop machine (a bit more cores are needed on computing servers).

On a 10 core Intel Xeon processor with a 40~43 GiB/s RAM, the code takes 1.32 seconds in parallel and 2.56 seconds in sequential. This means only 2 times faster with 10 cores. That being said, the parallel loop read the full C array once per time step and also read+write the full C_ array once per time step (x86 processors need to read the written memory by default due to the write allocate cache policy). The C[:] = C_ does the same thing. This means (2*3)*(8*10e6)*100/1024**3 = 44.7 GiB or RAM are read/written during only 1.32 second in parallel, resulting in a 33.9 GiB/s memory throughput reaching 80% of the RAM bandwidth (very good for this use-case).

To speed up this code, you need to read/write less data from/to RAM and compute data as much as possible in cache. The first thing to do is to use a double-buffering approach with two views so to avoid a very expensive copy. Another optimization is to try to do multiple time step in parallel at the same time. This is theoretically possible using a complex trapezoidal tiling strategy but very tricky to implement in practice, especially in Numba. High-performance stencil libraries should do that for you. Such optimization should not only improve the sequential execution but also the scalability of the resulting parallel code.

Upvotes: 3

Related Questions