Slower code with OpenMP how it can be parallelized?

This code is slower with OpenMP. Without OpenMP I get about 10s. With OpenMP i get about 40s. What is happening? Thank you very much friends!

for (i=2;i<(nnoib-2);++i){
    #pragma omp parallel for
    for (j=2; j<(nnojb-2); ++j) {
        C[i][j]= absi[i]*absj[j]*
                 (2.0f*B[i][j] + absi[i]*absj[j]*
                 (VEL[i][j]*VEL[i][j]*fat*
                 (16.0f*(B[i][j-1]+B[i][j+1]+B[i-1][j]+B[i+1][j])
                 -1.0f*(B[i][j-2]+B[i][j+2]+B[i-2][j]+B[i+2][j]) 
                 -60.0f*B[i][j]
                 )-A[i][j]));
        c2 = (abs(C[i][j]) > Amax[i][j]);
        if (c2) {
            Amax[i][j] = abs(C[i][j]);
            Ttra[i][j] = t;
        }
    }
}

Upvotes: 2

Answers (2)

Cristiano Sousa

Reputation: 932

Just because you're using OpenMP doesn't mean your program will run faster. A couple of things can be happening here:

There is a cost associated to spawning each thread, and if you spawn a thread to do a small amount of computation, the spawning of the thread itself will take more time than the computation.
By default, OpenMP will spawn the maximum number of threads supported by your CPU. With CPU's that support 2 or more threads per core, the threads will be competing for each core's resources. Using omp_get_num_threads() you can see how many threads will be spawned by default. I recommend trying running your code with half that value using omp_set_num_threads().

Did you confirm the results were the same with and without OpenMP? It seems there is a dependency with the variables j and c2. You should declare them private to each thread:

#pragma omp parallel for private(j,c2)

I wanted to add another thing: before attempting any parallelization, you should make sure that the code is already optimized.

Depending on your compiler, compiler flags and the complexity of the instruction, the compiler may or may not optimize your code:

// avoid calculation nnoib-2 every iteration
int t_nnoib = nnoib - 2;
for (i=2; i< t_nnoib; ++i){
    // avoid calculation nnojb-2 every iteration
    int t_nnojb = nnojb - 2;
    // avoid loading absi[i] every iteration
    int t_absi = absi[i];
    for (j=2; j< t_nnojb; ++j) {
        C[i][j]= t_absi * absj[j] *
             (2.0f*B[i][j] + t_absi * absj[j] *
             (VEL[i][j] * VEL[i][j] * fat *
             (16.0f * (B[i][j-1] + B[i][j+1] + B[i-1][j] + B[i+1][j])
              -1.0f * (B[i][j-2] + B[i][j+2] + B[i-2][j] + B[i+2][j]) 
              -60.0f * B[i][j]
             ) - A[i][j]));

        // c2 is a useless variable
        if (abs(C[i][j]) > Amax[i][j]) {
            Amax[i][j] = abs(C[i][j]);
            Ttra[i][j] = t;
         }
    }
}

It may not seem much, but it can have a huge impact on your code. The compiler will try to place local variables in registers (which have a much faster access time). Keep in mind that you cant apply this technique indefinitely since you have an limited number of registers, and abusing this will cause your code to suffer from register spilling.

In the case of the array absi, you'll avoid having the system keeping a piece of that array in cache during the execution of the j loop. The general idea of this technique is to move to the outer loop any array access that doesn't depend on the inner loop's variable.

Upvotes: 3

Phil Miller

Reputation: 38108

In addition to the costs mentioned by Cristiano, your choice to parallelize over the j loop rather than over the i loop poses a risk of false sharing in the three arrays being assigned, C, Amax, Ttra. Essentially, when one thread writes to an element of one of those arrays, contiguous elements on the same cache line will also be loaded to that core's cache. When another core then goes to write its own values to different entries, it will have to pull the line from the other cache, with multiple cores potentially playing 'tug-of-war'.

The solution to this is to parallelize the outer loop over i instead of the inner loop over j. Conveniently, that also drastically reduces the costs mentioned in Cristiano's answer, since the spawns and work assignments will only happen once, rather than for every iteration through the i loop. You'll still need to privatize j and c2, or simply inline the value of c2 in the subsequent if and eliminate the variable (as described in your comment). For better efficiency, using a locally declared variably instead of j would mean not having to access a thread-private variable.

Just as a (rather important) check, this loop nest is in fact the portion of your program that you've measured to be taking the bulk of the time? Adding the OpenMP pragma changed its time from just under 10s to just under 40s?

Upvotes: 2

Slower code with OpenMP how it can be parallelized?

Answers (2)

Related Questions