math
math

Reputation: 8848

Is there a way to control partitioning of OpenMP parallel_for construct?

I use OpenMP (OMP) for parallelizing a for loop. However, it seems that OMP will partition my for loop in equal interval sizes, e.g.

for( int i = 0; i < n; ++i ) {
 ...
}

there are NUM_THREADS blocks each sized n/NUM_THREADS. Unfortunately, I use this to parallelize a sweep over a triangular matrix, hence the last blocks have much more work to do than the first blocks. So what I really want to ask is how to perform load balancing in such a scenario. I could imagine, that if ( i % THREAD_NUMBER == 0) .. would be fine (in other words, each run in the loop is assigned to a different thread). I know this is not optimal, as caching then would be corrupted, however, is there a way to control the loop partitioning with OMP?

Upvotes: 3

Views: 1926

Answers (2)

Z boson
Z boson

Reputation: 33679

I think schedule(guided) is the right choice here. Though your statement that the last blocks have more work to do is opposite of what I would expect but it depends on how you're doing the loop. Normally I would run over a trianglar matrix something like this.

#pragma omp parallel for schedule(guided)
for(int i=0; i<n-1; i++) {
    for(int j=i+1; j<n; j++) {
        //M[i][j]
    }
}

Let's choose n=101 and look at some schedulers. Assume there are four threads

If you use the schedule(static) which is normally the default (but it does not have to be).

Thread one   i =  0-24, j =  1-100, j range = 100
Thread two   i = 25-49, j = 26-100, j range = 75
Thread three i = 50-74, j = 51-100, j range = 50
Thread four  i = 75-99, j = 76-100, j range = 25

So the fourth thread only goes over j 25 times compared to 100 times for the first thread. The load is not balanced. If we switch to schedule(guided) we get:

Thread one   i =  0-24, j =  1-100, j range = 100
Thread two   i = 25-44, j = 26-100, j range = 75
Thread three i = 45-69, j = 46-100, j range = 55
Thread four  i = 60-69, j = 61-100, j range = 40
Thread one   i = 70-76, j = 71-100
...

Now the fourth thread runs over j 40 times compared to 100 for thread 1. That's still not evenly balanced but it's a lot better. But the balancing gets better as the scheduler moves on to further iterations so it converges to better load balancing.

Upvotes: 2

Henkersmann
Henkersmann

Reputation: 1220

There is a clause that can be added to your #pragma omp for construct that is called schedule

With that you can specify how the chunks (what you would call one partition) are distributed over the threads

description of the scheduling variants can be found here. For your purpose dynamic or guided would fit best.

With dynamic each thread gets the same number of iterations ( << total_iterations) and requests more iterations if finished.

Wiht guided nearly the same is done but the number of iterations decreases during execution so you first get big amount of iterations and later lesser amount of iterations per request.

Upvotes: 2

Related Questions