Reputation: 757
I'm trying to put together a simple example to illustrate the benefits of using numba.prange
for myself and some colleagues, but I'm unable to get a decent speedup. I coded up a simple 1D diffusion solver, which essentially loops over a long array, combining elements i+1
, i
, and i-1
, and writing the results into element i
of a second array. This should be a pretty perfect use case for a parallel for loop, similar to OpenMP in Fortran or C.
My complete example is included below:
import numpy as np
from numba import jit, prange
@jit(nopython=True, parallel=True)
def diffusion(Nt):
alpha = 0.49
x = np.linspace(0, 1, 10000000)
# Initial condition
C = 1/(0.25*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-0.5)/0.25)**2)
# Temporary work array
C_ = np.zeros_like(C)
# Loop over time (normal for-loop)
for j in range(Nt):
# Loop over array elements (space, parallel for-loop)
for i in prange(1, len(C)-1):
C_[i] = C[i] + alpha*(C[i+1] - 2*C[i] + C[i-1])
C[:] = C_
return C
# Run once to just-in-time compile
C = diffusion(1)
# Check timing
%timeit C = diffusion(100)
When running with parallel=False
, this takes about 2 second, and with parallel=True
it takes about 1.5 seconds. I'm running on a MacBook Pro with 4 physical cores, and Activity Monitor reports 100% and around 700% CPU usage with and without parallelisation.
I would have expected something closer to a factor 4 speedup. Am I doing something wrong?
Upvotes: 3
Views: 469
Reputation: 50308
The poor scalability certainly comes from the saturation of the RAM which is shared by all cores on a desktop machine. Indeed, your code is memory-bound and the memory throughput is pretty limited on modern machines compared to the computing power of CPUs (or GPUs). As a result, 1 or 2 cores are generally enough to saturate the RAM on most desktop machine (a bit more cores are needed on computing servers).
On a 10 core Intel Xeon processor with a 40~43 GiB/s RAM, the code takes 1.32 seconds in parallel and 2.56 seconds in sequential. This means only 2 times faster with 10 cores. That being said, the parallel loop read the full C
array once per time step and also read+write the full C_
array once per time step (x86 processors need to read the written memory by default due to the write allocate cache policy). The C[:] = C_
does the same thing. This means (2*3)*(8*10e6)*100/1024**3 = 44.7
GiB or RAM are read/written during only 1.32 second in parallel, resulting in a 33.9 GiB/s memory throughput reaching 80% of the RAM bandwidth (very good for this use-case).
To speed up this code, you need to read/write less data from/to RAM and compute data as much as possible in cache. The first thing to do is to use a double-buffering approach with two views so to avoid a very expensive copy. Another optimization is to try to do multiple time step in parallel at the same time. This is theoretically possible using a complex trapezoidal tiling strategy but very tricky to implement in practice, especially in Numba. High-performance stencil libraries should do that for you. Such optimization should not only improve the sequential execution but also the scalability of the resulting parallel code.
Upvotes: 3