Reputation: 8848
I use OpenMP (OMP) for parallelizing a for loop. However, it seems that OMP will partition my for loop in equal interval sizes, e.g.
for( int i = 0; i < n; ++i ) {
...
}
there are NUM_THREADS
blocks each sized n/NUM_THREADS
. Unfortunately, I use this to parallelize a sweep over a triangular matrix, hence the last blocks have much more work to do than the first blocks. So what I really want to ask is how to perform load balancing in such a scenario. I could imagine, that if ( i % THREAD_NUMBER == 0) ..
would be fine (in other words, each run in the loop is assigned to a different thread). I know this is not optimal, as caching then would be corrupted, however, is there a way to control the loop partitioning with OMP?
Upvotes: 3
Views: 1926
Reputation: 33679
I think schedule(guided)
is the right choice here. Though your statement that the last blocks have more work to do is opposite of what I would expect but it depends on how you're doing the loop. Normally I would run over a trianglar matrix something like this.
#pragma omp parallel for schedule(guided)
for(int i=0; i<n-1; i++) {
for(int j=i+1; j<n; j++) {
//M[i][j]
}
}
Let's choose n=101 and look at some schedulers. Assume there are four threads
If you use the schedule(static)
which is normally the default (but it does not have to be).
Thread one i = 0-24, j = 1-100, j range = 100
Thread two i = 25-49, j = 26-100, j range = 75
Thread three i = 50-74, j = 51-100, j range = 50
Thread four i = 75-99, j = 76-100, j range = 25
So the fourth thread only goes over j 25 times compared to 100 times for the first thread. The load is not balanced. If we switch to schedule(guided)
we get:
Thread one i = 0-24, j = 1-100, j range = 100
Thread two i = 25-44, j = 26-100, j range = 75
Thread three i = 45-69, j = 46-100, j range = 55
Thread four i = 60-69, j = 61-100, j range = 40
Thread one i = 70-76, j = 71-100
...
Now the fourth thread runs over j 40 times compared to 100 for thread 1. That's still not evenly balanced but it's a lot better. But the balancing gets better as the scheduler moves on to further iterations so it converges to better load balancing.
Upvotes: 2
Reputation: 1220
There is a clause that can be added to your #pragma omp for
construct that is called schedule
With that you can specify how the chunks (what you would call one partition) are distributed over the threads
description of the scheduling variants can be found here. For your purpose dynamic
or guided
would fit best.
With dynamic
each thread gets the same number of iterations ( << total_iterations) and requests more iterations if finished.
Wiht guided
nearly the same is done but the number of iterations decreases during execution so you first get big amount of iterations and later lesser amount of iterations per request.
Upvotes: 2