agmnu
agmnu

Reputation: 21

Can idle threads in OpenMP/Cython be used to parallelize remaining section of the work block?

I am new to OpenMP and using it to parallelize a for-loop (to be accurate, I am using prange in Cython).

However, the operations are very uneven, and, as a result, there are quite a few idle threads till one block of the for-loop is completed.

I wanted to know whether there is a way to access the idle threads so that I can use them to parallelize the bottleneck operations.

Upvotes: 2

Views: 139

Answers (1)

ead
ead

Reputation: 34337

This question boils down to the question of perfect scheduling of tasks, which is quite hard for a general case, so usually one fall back to heuristics.

OpenMP offers different heuristics for scheduling, which can be chosen via schedule-argument to prange (documentation).

Let's look at the following example:

%%cython -c=/openmp --link-args=/openmp
cdef double calc(int n) nogil:
    cdef double d=0.0
    cdef int i
    for i in range(n):
        d+=0.1*i*n
    return d

def single_sum(int n):
    cdef int i
    cdef double sum = 0.0
    for i in range(n):
        sum += calc(i)
    return sum

The evaluation of calc takes O(n), because a IEEE 754 complying compiler is not able to optimize the for-loop.

Now let's replace range through prange:

...
from cython.parallel import prange   
def default_psum(int n):
    cdef int i
    cdef double sum = 0.0
    for i in prange(n, nogil=True, num_threads=2):
        sum += calc(i)
    return sum

I have chosen to limit the number of threads to 2, to make the effect more dramatic. Now, comparing the running times we see:

N=4*10**4
%timeit single_sum(N) #no parallelization
# 991 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit default_psum(N) #parallelization
# 751 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

not as much improvement as we would like (i.e. we would like speed-up 2)!

It is an implementation detail of OpenMP-provider, which schedule is chosen when it is not explicitly set - but most probably it will be "static" without defining chunksize. In this case, the range is halved and one threads becomes the first, fast half, while another the second, where almost all of the work must be done - so a big part of the work isn't parallelized in the end.

A better strategy to achieve a better balance is to give i=0 to the first thread, i=1 to the second, i=2 again to the first and so on. This can be achieved for "static"-schedule by setting chunksize to 1:

def static_psum1(int n):
    cdef int i
    cdef double sum = 0.0
    for i in prange(n, nogil=True, num_threads=2, schedule="static", chunksize=1):
        sum += calc(i)
    return sum

we almost reach the maximally possible speed-up of 2:

%timeit static_psum1(N)
# 511 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Choosing best schedule is a trade-off between scheduling overhead (not very high in the example above) and best work-balance - and the best trade-off can only be achieved only after analyzing the problem (and hardware!) at hand.

Here are some timings for the above example for different scheduling strategies and different number of threads:

(schedule,chunksize)        N=2                  N=8
single-threaded            991 ms               991 ms
(default)                  751 ms               265 ms
static                     757 ms               274 ms
static,1                   511 ms               197 ms
static,10                  512 ms               166 ms
dynamic,1                  509 ms               158 ms
dynamic,10                 509 ms               156 ms
guided                     508 ms               158 ms

Trying to use different schedules makes only sense, when there is at least a theoretical possibility to achieve a good balance.

If there is a task, which takes 90% of running time, then no matter which schedule-strategy is used - it will not be possible to improve the performance. In this case the big task itself should be parallelized, sadly Cython's support for OpenMP is somewhat lacking (see for example this SO-post), so possible it is better to code in pure C and then wrap the resulting functionality with Cython.

Upvotes: 2

Related Questions