Reputation: 5528
I was under the assumption that if I write my code in Cython using the nogil
directive, that would indeed bypass the gil and I could use a ThreadPoolExecutor
to use multiple cores. Or, more likely, I messed up something in the implementation, but I can't seem to figure out what.
I've written a simple n-body simulation using the Barnes-Hut algorithm, and would like to do the lookup in parallel:
# cython: boundscheck=False
# cython: wraparound=False
...
def estimate_forces(self, query_point):
...
cdef np.float64_t[:, :] forces
forces = np.zeros_like(query_point, dtype=np.float64)
estimate_forces_multiple(self.root, query_point.data, forces, self.theta)
return np.array(forces, dtype=np.float64)
cdef void estimate_forces_multiple(...) nogil:
for i in range(len(query_points)):
...
estimate_forces(cell, query_point, forces, theta)
And I call the code like this:
data = np.random.uniform(0, 100, (1000000, 2))
executor = ThreadPoolExecutor(max_workers=max_workers)
quad_tree = QuadTree(data)
chunks = np.array_split(data, max_workers)
forces = executor.map(quad_tree.estimate_forces, chunks)
forces = np.vstack(list(forces))
I've omitted lots of code in order to make the code in question clearer. It is my understanding that increasing max_workers
should use multiple cores and provide a substantial speedup, however, this does not seem to be the case:
> time python barnes_hut.py --max-workers 1
python barnes_hut.py 9.35s user 0.61s system 106% cpu 9.332 total
> time python barnes_hut.py --max-workers 2
python barnes_hut.py 9.05s user 0.64s system 107% cpu 9.048 total
> time python barnes_hut.py --max-workers 4
python barnes_hut.py 9.08s user 0.64s system 107% cpu 9.035 total
> time python barnes_hut.py --max-workers 8
python barnes_hut.py 9.12s user 0.71s system 108% cpu 9.098 total
Building the quad tree takes less than 1s, so the majority of the time is spent on estimate_forces_multiple
, but clearly, I get no speed up using multiple threads. Looking at top
, it doesn't appear to use multiple cores either.
My guess is that I must have missed something quite crucial, but I can't really figure out what.
Upvotes: 4
Views: 655
Reputation: 5528
I was missing a crucial part that actually signaled to release the GIL:
def estimate_forces(self, query_point):
...
cdef np.float64_t[:, :] forces
forces = np.zeros_like(query_point, dtype=np.float64)
# HERE
cdef DTYPE_t[:, :] query_points = query_point.data
with nogil:
estimate_forces_multiple(self.root, query_points, forces, self.theta)
return np.array(forces, dtype=np.float64)
I've also found that the UNIX time
command doesn't do what I wanted for multithreaded programs and reported the same numbers (I guess it reported the CPU time?). Using pythons timeit
provided expected results:
max_workers=1: 91.2366s
max_workers=2: 36.7975s
max_workers=4: 30.1390s
max_workers=8: 24.0240s
Upvotes: 4