Reputation: 105
I would like to post another example of numba thread problem that I do not know how to solve. It's a very simple problem that calculates in parallel the inverse of block matrices.
from numpy import empty,float64
from numpy.linalg import inv
from numba import njit, prange,set_num_threads
@njit(parallel=True)
def tester1(A,m,n):
inv_Lambda_p=empty((n,m,m),dtype=float64)
for i in prange(n):
inv_Lambda_p[i]=inv(ascontiguousarray(A[i]))
A=np.random.rand(1000000,10,10)
m=10
n=1000000
inv_a=tester1(A,m,n)
for i in range (16):
print("Thread ",i+1)
set_num_threads(i+1)
%timeit tester1(A,m,n)
Results
Thread 1
1.72 s ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 2
932 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 3
680 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 4
573 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 5
555 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 6
530 ms ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 7
511 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 8
492 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 9
478 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 10
471 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 11
498 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 12
451 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 13
442 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 14
431 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 15
421 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 16
442 ms ± 4.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Why the times for more threads are slower for less? For instance look for 15 to 16 threads. Is this normal and why? My cpu is 8 cores 16 threads 3.7GHz 16 MB cache, Ram 16gb. Also my cpu is cooled by hydro cooling system.
Upvotes: 1
Views: 513
Reputation: 50308
This appears to be due to allocations that does not scale in parallel. One solution is to rewrite this using a Gauss-Jordan elimination and take care not to allocate too many arrays. Here is an implementation:
@nb.njit('(float64[:,::1], float64[:,::1])', fastmath=True)
def gaussJordan(mat, out):
assert mat.shape[0] == mat.shape[1]
n = mat.shape[0]
tmpMat = np.zeros((n, 2*n))
# Copy the input matrix
for i in range(n):
for j in range(n):
tmpMat[i, j] = mat[i, j]
# Extends the matrix
for i in range(n):
tmpMat[i, n+i] = 1.0
# Pivot
for i in range(n - 1, 0, -1):
if tmpMat[i - 1, 0] < tmpMat[i, 0]:
for j in range(2*n):
temp = tmpMat[i, j]
tmpMat[i, j] = tmpMat[i - 1, j]
tmpMat[i - 1, j] = temp
# Actual elimination
for i in range(n):
for j in range(n):
if j != i:
temp = tmpMat[j, i] / tmpMat[i, i]
for k in range(2 * n):
tmpMat[j, k] -= tmpMat[i, k] * temp
# Rescaling and copy back to the output matrix
for i in range(n):
temp = 1.0 / tmpMat[i, i]
for j in range(n):
out[i, j] = tmpMat[i, n+j] * temp
@njit('(float64[:, :, :],)', parallel=True, fastmath=True)
def fastInv(mat):
assert mat.shape[1] == mat.shape[2]
n, m = mat.shape[0:2]
mat2 = np.ascontiguousarray(mat)
inv_Lambda_p = np.empty((n, m, m), dtype=np.float64)
for i in nb.prange(n):
gaussJordan(mat2[i], inv_Lambda_p[i])
return inv_Lambda_p
A=np.random.rand(1000000,10,10)
m=10
n=1000000
inv_a=fastInv(A)
for i in range(6):
print("Thread ",i+1)
set_num_threads(i+1)
%timeit fastInv(A)
The code performs only one allocation per iteration which should be enough for your use-case. The np.ascontiguousarray
is done once so to avoid many allocations if the array is not contiguous.
Here are results on my 6-core machine:
OLD VERSION:
Thread 1
2.26 s ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 2
1.34 s ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 3
1.07 s ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 4
945 ms ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 5
865 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 6
832 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
NEW VERSION:
Thread 1
2.4 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 2
1.26 s ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 3
868 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 4
667 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 5
551 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thread 6
487 ms ± 6.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that while the sequential version is a bit slower, the parallel implementation is 4.92 times faster which is pretty good. In fact, Intel processors like mine (i5-9600KF) reduce a bit the frequency when many cores are running (especially when packed SIMD AVX2 instructions are used): in parallel the frequency switch from 4.5 GHz to 4.3 GHz causing the optimal scaling to be 6 / (4.5 / 4.3) = 5.73
. I guess the scaling can be a bit better without the allocation in each iteration but this is hard to do better in Numba (there is an optimization in Numba so to allocate the array once in each thread if the code is inlined but the code is surprisingly much slower when this is done).
Note that the Gauss-Jordan elimination may not be as accurate as the Numpy method (which AFAIK use a LU factorization in log space in this case) but it should be fine with 10x10 matrices (as long as the input values in A
are also reasonable).
Finally, note that since your processor have 8 cores and 16 threads, the computation cannot be faster than 8 times for heavily computing operations like this. Hardware threads just improve the core usage by saturating more but they are not actual cores. They can only speed up the code more if computing units are not saturated which should be the case if the code is efficiently implemented. Hardware cores are typically useful to speed up latency-bound codes operating on the main memory for example.
Upvotes: 2