Numba in nonpython mode is much slower than pure python (no print statements or specified numpy functions)

Question

I have recently discovered that Numba may work much slower than pure python even in non-python mode with the parrallel=True option enabled.
Important: If you don't deal with Voronoi diagrams please continue reading, my question doesn't relate to them directly.
Currently, I am working on a problem where I have energy associated with the Voronoi diagram's edges and cells areas. The scipy Vornoi returns an array containing couples of points (vor.ridge_points) associated with each Vornoi edge. For my code, I want to have the ability to get index of the edge when providing indexes of the associated points, so I define a kind of adjacency matrix, but instead of ones and zeros, it has zeros and indexes of edges.
It turns out that pure python when performing cycles over numpy arrays turns to be 10 times faster than numba. Here is a toy example (i just randomly generated arrays, for the same number of edges and points as in my simulation).
My guess that it has something to do with memory allocation. Any take on the subject would be apprectiated (the reason why is it so much slower or a better way to get edge number from numbers of points) :)

# %%
from numba.np.ufunc import parallel
import numpy as np
from numba import njit
from numba import prange

# %% generating array that models array og ridges
points_number = 8802
ridges_number = 26379

np.random.seed(123)
ridge_points = np.random.randint(points_number, size=(ridges_number, 2))

# %% symmetric matrix containing indexes of all edges
# in space [original_point_1, original_point_2]
ridge_points = np.array(ridge_points, dtype=np.int32)


@njit(parallel=True, cache=True)
def jit_edges_matrix_op(r_p, r_n):
    matrix = np.zeros((r_n, r_n), dtype=np.int32)

    for i in prange(r_n):
        e1 = r_p[i, 0]
        e2 = r_p[i, 1]
        matrix[e1, e2] = i
        matrix[e2, e1] = i

    return matrix


e_matrix_op = jit_edges_matrix_op(ridge_points, ridges_number)
# %% the same but not jitted


def edges_matrix_op(r_p, r_n):
    matrix = np.zeros((r_n, r_n), dtype=np.int32)

    for i in range(r_n):
        e1 = r_p[i, 0]
        e2 = r_p[i, 1]
        matrix[e1, e2] = i
        matrix[e2, e1] = i

    return matrix


e_matrix_op = edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
jit_edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
edges_matrix_op(ridge_points, ridges_number)

UPDATE

Indeed parallelization is not working properly here, so I run tests with parallel=False. Here are the results
630 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=True

553 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=False

66.5 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - pure python

UPDATE 2

Thanks to max9111 sharing a link https://github.com/numba/numba/issues/7259 There seems to be an issue with allocating large arrays with zeros (np.zeros) The issue has been reported a couple of weeks ago, and the link contains some workaround examples.

I tried allocating np.empty()

29.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=True

44.7 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=False

60.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- pure python

And as you can see parallelized numba works the best, so this task is parallizable and overhead is not that big

Numba in nonpython mode is much slower than pure python (no print statements or specified numpy functions)

UPDATE

UPDATE 2

Answers (1)

Related Questions