Numba slows down the loop with independent iterations

Question

I'm trying to parallelize a simple python program using numba. It consists of two functions. The first one is the simple power method

@numba.jit(nopython=True)
def power_method(A, v):
    u = v.copy()
    for i in range(3 * 10**3):
        u = A @ u
        u /= np.linalg.norm(u)
    return u

And the second function iterates with the vector v over the grid and runs the power_method function for various vectors v.

@numba.jit(nopython=True, parallel=True)
def iterate_grid(A, scale, sz):
    assert A.shape[0] == A.shape[1] == 3
    n = A.shape[0]

    results = np.empty((sz**3, n))
    tmp = np.linspace(-scale, scale, sz)
    
    for i1 in range(sz):
        v = np.empty(n, dtype=np.float64)
        v1 = tmp[i1]
        for i2, v2 in enumerate(tmp):
            for i3, v3 in enumerate(tmp):
                v[0] = v1
                v[1] = v2
                v[2] = v3

                u = power_method(A, v)

                idx = i1 * sz**2 + i2 * sz + i3
                results[idx] = u.copy()
    
    return results

Then I run it with

n = 3
A = np.random.randn(n, n)
iterate_grid(A, 5.0, 20)

All iterations are independent. Moreover, the calculations ideally fall into the cache, so I would expect that parallelizing the first loop with prange will give approximately linear acceleration.

However, the wall time for the sequential code is 6.07 s, while for the parallel code

@numba.jit(nopython=True, parallel=True)
def iterate_grid(A, scale, sz):
    assert A.shape[0] == A.shape[1] == 3
    n = A.shape[0]

    results = np.empty((sz**3, n))
    tmp = np.linspace(-scale, scale, sz)
    
    for i1 in numba.prange(sz):
        v = np.empty(n, dtype=np.float64)
        v1 = tmp[i1]
        for i2, v2 in enumerate(tmp):
            for i3, v3 in enumerate(tmp):
                v[0] = v1
                v[1] = v2
                v[2] = v3

                u = power_method(A, v)

                idx = i1 * sz**2 + i2 * sz + i3
                results[idx] = u.copy()
    
    return results

the wall time is 7.79 s. That is, parallelization slows down the code in this case.

Moreover, as I can see from iterate_grid.parallel_diagnostics(level=4), numba fuses tmp = np.linspace(-scale, scale, sz) and for i1 in range(sz): which is incorrect, because I need all values of tmp in the inner loops.

Parallelizing the power method is not the task I'm actually trying to solve, it's just a small example on which numba does not behave as I expect. Can you explain to me this behavior of numba and advise me how I can parallelize iterations on the grid to get linear acceleration?

Thank you in advance for your help!

J&#233;r&#244;me Richard · Accepted Answer

Correctness prevails over performance. Thus, you need to check results are actually correct before trying the measure the performance of the implementation. And it turns out the result between the sequential and parallel implementations are different (you can use np.allclose to check that or plot the array differences). Since the parallel code is basically the same except the use of prange, a race condition is very likely to be the cause. Running multiple times the program and checking if results are different is a good way to check for the probable presence of race condition (note the absence of difference does not mean there are no race conditions and other problems can cause non-deterministic results). A practical test on my machine gives very different results.

The thing is your code looks completely correct! Generally this means that the problem is due to undefined behaviours or compiler/runtime bugs. A deeper analysis show that the problem does not appear if tmp = np.linspace(-scale, scale, sz) is moved within the parallel loop. This is certainly due to a bug in Numba.

Assuming the results would be correct, the performance is strongly dependant of the power_method function which is not efficient. Indeed, neither Numba nor Numpy are optimized for very small arrays. Numba does not often optimize out array allocations except in some specific cases (like explicit allocations in a prange-based loop with detected invariants or similar cases as pointed out by @max9111): arrays are allocated on the heap using a quite slow allocation method. On top of that allocations do not scale preventing any parallel speed up. An optimization is to do the operation manually using 3 variables. Here is an optimized implementation:

@numba.njit
def fast_power_method(A, v1, v2, v3):
    # Unpacking
    # Note: there is no need for a copy here
    u1, u2, u3 = v1, v2, v3
    A11, A12, A13 = A[0]
    A21, A22, A23 = A[1]
    A31, A32, A33 = A[2]

    for i in range(3_000):
        # Optimized matrix multiplication
        t1 = u1 * A11 + u2 * A12 + u3 * A13
        t2 = u1 * A21 + u2 * A22 + u3 * A23
        t3 = u1 * A31 + u2 * A32 + u3 * A33

        # Renormalization
        # Note: multiplications are faster than divisions
        norm = np.sqrt(t1**2 + t2**2 + t3**2)
        inv_norm = 1.0 / norm
        u1 = t1 * inv_norm
        u2 = t2 * inv_norm
        u3 = t3 * inv_norm

    return u1, u2, u3

@numba.njit
def iterate_grid(A, scale, sz):
    assert A.shape[0] == A.shape[1] == 3
    n = A.shape[0]

    results = np.empty((sz**3, n))
    tmp = np.linspace(-scale, scale, sz)

    for i1 in range(sz):
        v = np.empty(n, dtype=np.float64)
        v1 = tmp[i1]
        for i2, v2 in enumerate(tmp):
            for i3, v3 in enumerate(tmp):
                u1, u2, u3 = fast_power_method(A, v1, v2, v3)

                idx = i1 * sz**2 + i2 * sz + i3
                results[idx, 0] = u1
                results[idx, 1] = u2
                results[idx, 2] = u3
    
    return results

This implementation takes 0.4 seconds while the initial version takes about 10 seconds. Thus, it is 25 times faster while still being sequential. Results are slightly different due to floating-point rounding and non-associativity. Note that using prange with this version is 5 times faster on my 6-core machine so it scale well but results are still wrong for the same reason. This shows at least that allocations was certainly the bottleneck.

The rule of thumb is to check results and that micro-optimizations can result in a faster execution than using multiple cores (especially for heavily numerical codes).

UPDATE: I filled a Numba bug available here.

Numba slows down the loop with independent iterations

Answers (1)

Related Questions