Reputation: 133
I'm trying to parallelize a simple python program using numba
. It consists of two functions. The first one is the simple power method
@numba.jit(nopython=True)
def power_method(A, v):
u = v.copy()
for i in range(3 * 10**3):
u = A @ u
u /= np.linalg.norm(u)
return u
And the second function iterates with the vector v
over the grid and runs the power_method
function for various vectors v
.
@numba.jit(nopython=True, parallel=True)
def iterate_grid(A, scale, sz):
assert A.shape[0] == A.shape[1] == 3
n = A.shape[0]
results = np.empty((sz**3, n))
tmp = np.linspace(-scale, scale, sz)
for i1 in range(sz):
v = np.empty(n, dtype=np.float64)
v1 = tmp[i1]
for i2, v2 in enumerate(tmp):
for i3, v3 in enumerate(tmp):
v[0] = v1
v[1] = v2
v[2] = v3
u = power_method(A, v)
idx = i1 * sz**2 + i2 * sz + i3
results[idx] = u.copy()
return results
Then I run it with
n = 3
A = np.random.randn(n, n)
iterate_grid(A, 5.0, 20)
All iterations are independent. Moreover, the calculations ideally fall into the cache, so I would expect that parallelizing the first loop with prange
will give approximately linear acceleration.
However, the wall time for the sequential code is 6.07 s, while for the parallel code
@numba.jit(nopython=True, parallel=True)
def iterate_grid(A, scale, sz):
assert A.shape[0] == A.shape[1] == 3
n = A.shape[0]
results = np.empty((sz**3, n))
tmp = np.linspace(-scale, scale, sz)
for i1 in numba.prange(sz):
v = np.empty(n, dtype=np.float64)
v1 = tmp[i1]
for i2, v2 in enumerate(tmp):
for i3, v3 in enumerate(tmp):
v[0] = v1
v[1] = v2
v[2] = v3
u = power_method(A, v)
idx = i1 * sz**2 + i2 * sz + i3
results[idx] = u.copy()
return results
the wall time is 7.79 s. That is, parallelization slows down the code in this case.
Moreover, as I can see from iterate_grid.parallel_diagnostics(level=4)
, numba fuses
tmp = np.linspace(-scale, scale, sz)
and for i1 in range(sz):
which is incorrect, because I need all values of tmp
in the inner loops.
Parallelizing the power method is not the task I'm actually trying to solve, it's just a small example on which numba does not behave as I expect. Can you explain to me this behavior of numba and advise me how I can parallelize iterations on the grid to get linear acceleration?
Thank you in advance for your help!
Upvotes: 3
Views: 309
Reputation: 50308
Correctness prevails over performance. Thus, you need to check results are actually correct before trying the measure the performance of the implementation. And it turns out the result between the sequential and parallel implementations are different (you can use np.allclose
to check that or plot the array differences). Since the parallel code is basically the same except the use of prange
, a race condition is very likely to be the cause. Running multiple times the program and checking if results are different is a good way to check for the probable presence of race condition (note the absence of difference does not mean there are no race conditions and other problems can cause non-deterministic results). A practical test on my machine gives very different results.
The thing is your code looks completely correct! Generally this means that the problem is due to undefined behaviours or compiler/runtime bugs. A deeper analysis show that the problem does not appear if tmp = np.linspace(-scale, scale, sz)
is moved within the parallel loop. This is certainly due to a bug in Numba.
Assuming the results would be correct, the performance is strongly dependant of the power_method
function which is not efficient. Indeed, neither Numba nor Numpy are optimized for very small arrays. Numba does not often optimize out array allocations except in some specific cases (like explicit allocations in a prange-based loop with detected invariants or similar cases as pointed out by @max9111): arrays are allocated on the heap using a quite slow allocation method. On top of that allocations do not scale preventing any parallel speed up. An optimization is to do the operation manually using 3 variables. Here is an optimized implementation:
@numba.njit
def fast_power_method(A, v1, v2, v3):
# Unpacking
# Note: there is no need for a copy here
u1, u2, u3 = v1, v2, v3
A11, A12, A13 = A[0]
A21, A22, A23 = A[1]
A31, A32, A33 = A[2]
for i in range(3_000):
# Optimized matrix multiplication
t1 = u1 * A11 + u2 * A12 + u3 * A13
t2 = u1 * A21 + u2 * A22 + u3 * A23
t3 = u1 * A31 + u2 * A32 + u3 * A33
# Renormalization
# Note: multiplications are faster than divisions
norm = np.sqrt(t1**2 + t2**2 + t3**2)
inv_norm = 1.0 / norm
u1 = t1 * inv_norm
u2 = t2 * inv_norm
u3 = t3 * inv_norm
return u1, u2, u3
@numba.njit
def iterate_grid(A, scale, sz):
assert A.shape[0] == A.shape[1] == 3
n = A.shape[0]
results = np.empty((sz**3, n))
tmp = np.linspace(-scale, scale, sz)
for i1 in range(sz):
v = np.empty(n, dtype=np.float64)
v1 = tmp[i1]
for i2, v2 in enumerate(tmp):
for i3, v3 in enumerate(tmp):
u1, u2, u3 = fast_power_method(A, v1, v2, v3)
idx = i1 * sz**2 + i2 * sz + i3
results[idx, 0] = u1
results[idx, 1] = u2
results[idx, 2] = u3
return results
This implementation takes 0.4 seconds while the initial version takes about 10 seconds. Thus, it is 25 times faster while still being sequential. Results are slightly different due to floating-point rounding and non-associativity. Note that using prange
with this version is 5 times faster on my 6-core machine so it scale well but results are still wrong for the same reason. This shows at least that allocations was certainly the bottleneck.
The rule of thumb is to check results and that micro-optimizations can result in a faster execution than using multiple cores (especially for heavily numerical codes).
UPDATE: I filled a Numba bug available here.
Upvotes: 2