Cython nogil with ThreadPoolExecutor not giving speedups

Question

I was under the assumption that if I write my code in Cython using the nogil directive, that would indeed bypass the gil and I could use a ThreadPoolExecutor to use multiple cores. Or, more likely, I messed up something in the implementation, but I can't seem to figure out what.

I've written a simple n-body simulation using the Barnes-Hut algorithm, and would like to do the lookup in parallel:

# cython: boundscheck=False
# cython: wraparound=False
...

def estimate_forces(self, query_point):
    ...
    cdef np.float64_t[:, :] forces

    forces = np.zeros_like(query_point, dtype=np.float64)
    estimate_forces_multiple(self.root, query_point.data, forces, self.theta)

    return np.array(forces, dtype=np.float64)


cdef void estimate_forces_multiple(...) nogil:
    for i in range(len(query_points)):
        ...
        estimate_forces(cell, query_point, forces, theta)

And I call the code like this:

data = np.random.uniform(0, 100, (1000000, 2))

executor = ThreadPoolExecutor(max_workers=max_workers)

quad_tree = QuadTree(data)

chunks = np.array_split(data, max_workers)
forces = executor.map(quad_tree.estimate_forces, chunks)
forces = np.vstack(list(forces))

I've omitted lots of code in order to make the code in question clearer. It is my understanding that increasing max_workers should use multiple cores and provide a substantial speedup, however, this does not seem to be the case:

> time python barnes_hut.py --max-workers 1
python barnes_hut.py  9.35s user 0.61s system 106% cpu 9.332 total

> time python barnes_hut.py --max-workers 2
python barnes_hut.py  9.05s user 0.64s system 107% cpu 9.048 total

> time python barnes_hut.py --max-workers 4
python barnes_hut.py  9.08s user 0.64s system 107% cpu 9.035 total

> time python barnes_hut.py --max-workers 8
python barnes_hut.py  9.12s user 0.71s system 108% cpu 9.098 total

Building the quad tree takes less than 1s, so the majority of the time is spent on estimate_forces_multiple, but clearly, I get no speed up using multiple threads. Looking at top, it doesn't appear to use multiple cores either.

My guess is that I must have missed something quite crucial, but I can't really figure out what.

Pavlin · Accepted Answer

I was missing a crucial part that actually signaled to release the GIL:

def estimate_forces(self, query_point):
    ...
    cdef np.float64_t[:, :] forces

    forces = np.zeros_like(query_point, dtype=np.float64)
    # HERE
    cdef DTYPE_t[:, :] query_points = query_point.data
    with nogil:
        estimate_forces_multiple(self.root, query_points, forces, self.theta)

    return np.array(forces, dtype=np.float64)

I've also found that the UNIX time command doesn't do what I wanted for multithreaded programs and reported the same numbers (I guess it reported the CPU time?). Using pythons timeit provided expected results:

max_workers=1: 91.2366s
max_workers=2: 36.7975s
max_workers=4: 30.1390s
max_workers=8: 24.0240s

Cython nogil with ThreadPoolExecutor not giving speedups

Answers (1)

Related Questions