pritam_coder
pritam_coder

Reputation: 13

Open MP bottleneck issue

I was trying to observe a basic openMP based parallelism with the following code,

#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
#include <time.h>
int main(){
  long i;
  long x[] = {0,0,0,0};
  omp_set_num_threads(4);
  clock_t time=clock();
  #pragma omp parallel for
  for(i=0;i<100000000;i++){
    x[omp_get_thread_num()]++;
  }
  double time_taken = (double)(clock() - time) / CLOCKS_PER_SEC;
  printf("%ld %ld %ld %ld %lf\n",x[0],x[1],x[2],x[3],time_taken);
}

Now, I am using a quad core i5 processor. I have checked 4 different values of the threads. The following results are found,

Set: omp_set_num_threads(1);
Out: 100000000 0 0 0 0.203921

Set: omp_set_num_threads(2);
Out: 50000000 50000000 0 0 0.826322

Set: omp_set_num_threads(3);
Out: 33333334 33333333 33333333 0 1.448936

Set: omp_set_num_threads(4);
Out: 25000000 25000000 25000000 25000000 1.919655

The x array values are accurate. But the time is surprisingly increasing in the increased number of threads. I can not get any explanation/justification behind this phenomenon. Is it somehow, omp_get_thread_num() function that is atomic in nature ? Or something else that I am missing out ?

Compiling as, gcc -o test test.c -fopenmp

UPDATE

So, as per the suggestion in the accepted answer, I have modified the code as follows,

#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
int main(){
  long i, t_id, fact=1096;
  long x[fact*4];
  x[0]=x[fact]=x[2*fact]=x[3*fact]=0;
  omp_set_num_threads(4);
  double time = omp_get_wtime();
  #pragma omp parallel for private(t_id)
    for(i=0;i<100000000;i++){
      t_id = omp_get_thread_num();
      x[t_id*fact]++;
  }
  double time_taken = omp_get_wtime() - time;
  printf("%ld %ld %ld %ld %lf\n",x[0],x[fact],x[2*fact],x[3*fact],time_taken);
}

Now, the results are understandable,

Set: omp_set_num_threads(1)
Out: 100000000 0 0 0 0.250205

Set: omp_set_num_threads(2)
Out: 50000000 50000000 0 0 0.154980

Set: omp_set_num_threads(3)
Out: 33333334 33333333 33333333 0 0.078874

Set: omp_set_num_threads(4)
Out: 25000000 25000000 25000000 25000000 0.061155

Therefore, it was about the cache line size as explained in the accepted answer. Have a look there to get the answer.

Upvotes: 1

Views: 133

Answers (2)

Zulan
Zulan

Reputation: 22670

The reason for your observation, as noted by gha.st, are false sharing and the properties of the clock function.

For this reason x[omp_get_thread_num()], is an anti-pattern. Sure, you can leverage your new knowledge by adding a stride in the memory. But this also encodes hardware-specific properties (i.e. cache line size) into your data structures. This can lead to nasty code that is difficult to understand and still has bad performance portability.

The idiomatic solution is to use either of the following:

  • If you are only interested in an aggregate, use a reduction clause, i.e.:

    long x = 0;
    #pragma omp parallel for reduction(+:x)
    for(i=0;i<100000000;i++){
        x++;
    }
    // total sum is now in x
    
  • If you need individual values within the thread, just use a private variable, preferably implicitly by scope. Or if you need particular initialization from outside the construct use firstprivate.

    #pragma omp parallel
    {
        long local_x = 0; // implicitly private by scope!
        #pragma omp for
        for(i=0;i<100000000;i++) {
            local_x++;
        }
        // can now do something with the the sum of the current thread.
    }
    
  • And if you need the per-thread results outside, you can just use the second form and write the result once:

    #pragma omp parallel
    {
        long local_x = 0; // implicitly private by scope!
        #pragma omp for
        for(i=0;i<100000000;i++) {
            local_x++;
        }
        x[omp_get_thread_num()] = local_x;
    }
    

That's not to say that you never need to design a datastructure with false-sharing in mind. But it's not as common as you might think.

Upvotes: 2

danielschemmel
danielschemmel

Reputation: 11126

Note that the 4 integers that you are operating on lie very closely together, probably on one cache line. Since cache lines are loaded into the CPU cache in one go, each thread needs to ensure that it has the latest version of that cache line. Since all threads want to modify (and not just read) that one cache line, they are constantly invalidating one another's copy. Welcome to false sharing!

To solve this problem, ensure that the integers are (physically) far enough apart from one another, e.g., by allocating structures that fill (at least) one full cache line for each thread to work with.

When executing your sample program using 4 threads on one of my machines, I got the following result:

25000000 25000000 25000000 25000000 5.049694

When modifying the program, such that the array has 4096 elements, and using the elements 0, 1024, 2048 and 3072 (which ensures enough distance), the program runs a lot faster:

25000000 25000000 25000000 25000000 1.617231

Note that although you are counting the processor time used by the whole process, without false sharing, the time should not increase significantly, but rather be more or less constant (there is some additional locking involved, but it should not usually be on the order of a 10x increase). In fact, the performance boost shown above also translates into wall-clock time (~1.25 seconds to ~500ms).

Upvotes: 2

Related Questions