ABZANMASTER
ABZANMASTER

Reputation: 105

Why numba parallel is slower when using more threads in some cases?

I would like to post another example of numba thread problem that I do not know how to solve. It's a very simple problem that calculates in parallel the inverse of block matrices.

from numpy import empty,float64
from numpy.linalg import inv
from numba import njit, prange,set_num_threads
@njit(parallel=True)
def tester1(A,m,n):
    inv_Lambda_p=empty((n,m,m),dtype=float64)
    for i in prange(n):
        inv_Lambda_p[i]=inv(ascontiguousarray(A[i]))
A=np.random.rand(1000000,10,10)
m=10
n=1000000
inv_a=tester1(A,m,n)
for i in range (16):
    print("Thread ",i+1)
    set_num_threads(i+1)
    %timeit tester1(A,m,n)

Results

Thread  1
1.72 s ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  2
932 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  3
680 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  4
573 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  5
555 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  6
530 ms ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  7
511 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  8
492 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  9
478 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  10
471 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  11
498 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  12
451 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  13
442 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  14
431 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  15
421 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  16
442 ms ± 4.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why the times for more threads are slower for less? For instance look for 15 to 16 threads. Is this normal and why? My cpu is 8 cores 16 threads 3.7GHz 16 MB cache, Ram 16gb. Also my cpu is cooled by hydro cooling system.

Upvotes: 1

Views: 513

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50308

This appears to be due to allocations that does not scale in parallel. One solution is to rewrite this using a Gauss-Jordan elimination and take care not to allocate too many arrays. Here is an implementation:

@nb.njit('(float64[:,::1], float64[:,::1])', fastmath=True)
def gaussJordan(mat, out):
    assert mat.shape[0] == mat.shape[1]
    n = mat.shape[0]

    tmpMat = np.zeros((n, 2*n))

    # Copy the input matrix
    for i in range(n):
        for j in range(n):
            tmpMat[i, j] = mat[i, j]
 
    # Extends the matrix
    for i in range(n):
        tmpMat[i, n+i] = 1.0
 
    # Pivot
    for i in range(n - 1, 0, -1):
        if tmpMat[i - 1, 0] < tmpMat[i, 0]:
            for j in range(2*n):
                temp = tmpMat[i, j]
                tmpMat[i, j] = tmpMat[i - 1, j]
                tmpMat[i - 1, j] = temp

    # Actual elimination
    for i in range(n):
        for j in range(n):
            if j != i:
                temp = tmpMat[j, i] / tmpMat[i, i]
                for k in range(2 * n):
                    tmpMat[j, k] -= tmpMat[i, k] * temp
 
    # Rescaling and copy back to the output matrix
    for i in range(n):
        temp = 1.0 / tmpMat[i, i]
        for j in range(n):
            out[i, j] = tmpMat[i, n+j] * temp

@njit('(float64[:, :, :],)', parallel=True, fastmath=True)
def fastInv(mat):
    assert mat.shape[1] == mat.shape[2]
    n, m = mat.shape[0:2]
    mat2 = np.ascontiguousarray(mat)
    inv_Lambda_p = np.empty((n, m, m), dtype=np.float64)
    for i in nb.prange(n):
        gaussJordan(mat2[i], inv_Lambda_p[i])
    return inv_Lambda_p

A=np.random.rand(1000000,10,10)
m=10
n=1000000
inv_a=fastInv(A)
for i in range(6):
    print("Thread ",i+1)
    set_num_threads(i+1)
    %timeit fastInv(A)

The code performs only one allocation per iteration which should be enough for your use-case. The np.ascontiguousarray is done once so to avoid many allocations if the array is not contiguous.

Here are results on my 6-core machine:

OLD VERSION:

Thread  1
2.26 s ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  2
1.34 s ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  3
1.07 s ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  4
945 ms ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  5
865 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  6
832 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


NEW VERSION:

Thread  1
2.4 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  2
1.26 s ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  3
868 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  4
667 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  5
551 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread  6
487 ms ± 6.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that while the sequential version is a bit slower, the parallel implementation is 4.92 times faster which is pretty good. In fact, Intel processors like mine (i5-9600KF) reduce a bit the frequency when many cores are running (especially when packed SIMD AVX2 instructions are used): in parallel the frequency switch from 4.5 GHz to 4.3 GHz causing the optimal scaling to be 6 / (4.5 / 4.3) = 5.73. I guess the scaling can be a bit better without the allocation in each iteration but this is hard to do better in Numba (there is an optimization in Numba so to allocate the array once in each thread if the code is inlined but the code is surprisingly much slower when this is done).

Note that the Gauss-Jordan elimination may not be as accurate as the Numpy method (which AFAIK use a LU factorization in log space in this case) but it should be fine with 10x10 matrices (as long as the input values in A are also reasonable).

Finally, note that since your processor have 8 cores and 16 threads, the computation cannot be faster than 8 times for heavily computing operations like this. Hardware threads just improve the core usage by saturating more but they are not actual cores. They can only speed up the code more if computing units are not saturated which should be the case if the code is efficiently implemented. Hardware cores are typically useful to speed up latency-bound codes operating on the main memory for example.

Upvotes: 2

Related Questions