Reputation: 21
I am new to OpenMP and using it to parallelize a for-loop (to be accurate, I am using prange in Cython).
However, the operations are very uneven, and, as a result, there are quite a few idle threads till one block of the for-loop is completed.
I wanted to know whether there is a way to access the idle threads so that I can use them to parallelize the bottleneck operations.
Upvotes: 2
Views: 139
Reputation: 34337
This question boils down to the question of perfect scheduling of tasks, which is quite hard for a general case, so usually one fall back to heuristics.
OpenMP offers different heuristics for scheduling, which can be chosen via schedule
-argument to prange
(documentation).
Let's look at the following example:
%%cython -c=/openmp --link-args=/openmp
cdef double calc(int n) nogil:
cdef double d=0.0
cdef int i
for i in range(n):
d+=0.1*i*n
return d
def single_sum(int n):
cdef int i
cdef double sum = 0.0
for i in range(n):
sum += calc(i)
return sum
The evaluation of calc
takes O(n)
, because a IEEE 754 complying compiler is not able to optimize the for-loop.
Now let's replace range
through prange
:
...
from cython.parallel import prange
def default_psum(int n):
cdef int i
cdef double sum = 0.0
for i in prange(n, nogil=True, num_threads=2):
sum += calc(i)
return sum
I have chosen to limit the number of threads to 2, to make the effect more dramatic. Now, comparing the running times we see:
N=4*10**4
%timeit single_sum(N) #no parallelization
# 991 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit default_psum(N) #parallelization
# 751 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
not as much improvement as we would like (i.e. we would like speed-up 2)!
It is an implementation detail of OpenMP-provider, which schedule is chosen when it is not explicitly set - but most probably it will be "static"
without defining chunksize
. In this case, the range is halved and one threads becomes the first, fast half, while another the second, where almost all of the work must be done - so a big part of the work isn't parallelized in the end.
A better strategy to achieve a better balance is to give i=0
to the first thread, i=1
to the second, i=2
again to the first and so on. This can be achieved for "static"
-schedule by setting chunksize
to 1:
def static_psum1(int n):
cdef int i
cdef double sum = 0.0
for i in prange(n, nogil=True, num_threads=2, schedule="static", chunksize=1):
sum += calc(i)
return sum
we almost reach the maximally possible speed-up of 2:
%timeit static_psum1(N)
# 511 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Choosing best schedule is a trade-off between scheduling overhead (not very high in the example above) and best work-balance - and the best trade-off can only be achieved only after analyzing the problem (and hardware!) at hand.
Here are some timings for the above example for different scheduling strategies and different number of threads:
(schedule,chunksize) N=2 N=8
single-threaded 991 ms 991 ms
(default) 751 ms 265 ms
static 757 ms 274 ms
static,1 511 ms 197 ms
static,10 512 ms 166 ms
dynamic,1 509 ms 158 ms
dynamic,10 509 ms 156 ms
guided 508 ms 158 ms
Trying to use different schedules makes only sense, when there is at least a theoretical possibility to achieve a good balance.
If there is a task, which takes 90% of running time, then no matter which schedule-strategy is used - it will not be possible to improve the performance. In this case the big task itself should be parallelized, sadly Cython's support for OpenMP is somewhat lacking (see for example this SO-post), so possible it is better to code in pure C and then wrap the resulting functionality with Cython.
Upvotes: 2