Reputation: 134

How to optimally parallelize nested loops?

I'm writing a program that should run both in serial and parallel versions. Once I get it to actually do what it is supposed to do I started trying to parallelize it with OpenMP (compulsory).

The thing is I can't find documentation or references on when to use what #pragma. So I am trying my best at guessing and testing. But testing is not going fine with nested loops.

How would you parallelize a series of nested loops like these:

for(int i = 0; i < 3; ++i){
    for(int j = 0; j < HEIGHT; ++j){
        for(int k = 0; k < WIDTH; ++k){
            switch(i){
                case 0:
                        matrix[j][k].a = matrix[j][k] * someValue1;
                        break;
                case 1:
                        matrix[j][k].b = matrix[j][k] * someValue2;
                        break;   
                case 2:
                        matrix[j][k].c = matrix[j][k] * someValue3;                
                        break;
            }
        }
    }
}

HEIGHT and WIDTH are usually the same size in the tests I have to run. Some test examples are 32x32 and 4096x4096.
matrix is an array of custom structs with attributes a, b and c
someValue is a double

I know that OpenMP is not always good for nested loops but any help is welcome.

[UPDATE]:

So far I've tried unrolling the loops. It boosts performance but am I adding unnecesary overhead here? Am I reusing threads? I tried getting the id of the threads used in each for but didn't get it right.

#pragma omp parallel
        {
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
        }

[UPDATE 2]

Apart from unrolling the loop I have tried parallelizing the outer loop (worst performance boost than unrolling) and collapsing the two inner loops (more or less same performance boost as unrolling). This are the times I am getting.

Serial: ~130 milliseconds
Loop unrolling: ~49 ms
Collapsing two innermost loops: ~55 ms
Parallel outermost loop: ~83 ms

What do you think is the safest option? I mean, which should be generally the best for most systems, not only my computer?

Upvotes: 0

Answers (3)

Davislor

Reputation: 15144

You probably want to parallelize this example for simd so the compiler can vectorize, collapse the loops because you use j and k only in the expression matrix[j][k], and because there are no dependencies on any other element of the matrix. If nothing modifies somevalue1, etc., they should be uniform. Time your loop to be sure those really do improve your speed.

Upvotes: 1

High Performance Mark

Reputation: 78314

Here's another option, one which recognises that distributing the iterations of the outermost loop when there are only 3 of them might lead to very poor load balancing,

i=0
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
    for(int k = 0; k < WIDTH; ++k){
    ...
}

i=1
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
    for(int k = 0; k < WIDTH; ++k){
    ...
}

i=2
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
    for(int k = 0; k < WIDTH; ++k){
    ...
}

Warning -- check the syntax yourself, this is no more than a sketch of manual loop unrolling.

Try combining this and collapsing the j and k loops.

Oh, and don't complain about code duplication, you've told us you're being scored partly on performance improvements.

Upvotes: 1

The Quantum Physicist

Reputation: 26286

The problem with OpenMP is that it's very high-level, meaning that you can't access low-level functionality, such as spawning the thread, and then reusing it. So let me make it clear what you can and what you can't do:

Assuming you don't need any mutex to protect against race conditions, here are your options:

You parallelize your outer-most loop, and that will use 3 threads, and that's the most peaceful solution you're gonna have
You parallelize the first inner loop with, and then you'll have a performance boost only if the overhead of spawning a new thread for every WIDTH element is much smaller the efforts required to perform the most inner loop.
Parallelizing the most inner loop, but this is the worst solution in the world, because you'll respawn the threads 3*HEIGHT times. Never do that!
Not use OpenMP, and use something low-level, such as std::thread, where you can create your own Thread Pool, and push all the operations you want to do in a queue.

Hope this helps to put things in perspective.

Upvotes: 1

How to optimally parallelize nested loops?

Answers (3)

Related Questions