Open MP bottleneck issue

Question

I was trying to observe a basic openMP based parallelism with the following code,

#include
#include
#include
#include 
int main(){
  long i;
  long x[] = {0,0,0,0};
  omp_set_num_threads(4);
  clock_t time=clock();
  #pragma omp parallel for
  for(i=0;i<100000000;i++){
    x[omp_get_thread_num()]++;
  }
  double time_taken = (double)(clock() - time) / CLOCKS_PER_SEC;
  printf("%ld %ld %ld %ld %lf
",x[0],x[1],x[2],x[3],time_taken);
}

Now, I am using a quad core i5 processor. I have checked 4 different values of the threads. The following results are found,

Set: omp_set_num_threads(1);
Out: 100000000 0 0 0 0.203921

Set: omp_set_num_threads(2);
Out: 50000000 50000000 0 0 0.826322

Set: omp_set_num_threads(3);
Out: 33333334 33333333 33333333 0 1.448936

Set: omp_set_num_threads(4);
Out: 25000000 25000000 25000000 25000000 1.919655

The x array values are accurate. But the time is surprisingly increasing in the increased number of threads. I can not get any explanation/justification behind this phenomenon. Is it somehow, omp_get_thread_num() function that is atomic in nature ? Or something else that I am missing out ?

Compiling as, gcc -o test test.c -fopenmp

UPDATE

So, as per the suggestion in the accepted answer, I have modified the code as follows,

#include
#include
#include
int main(){
  long i, t_id, fact=1096;
  long x[fact*4];
  x[0]=x[fact]=x[2*fact]=x[3*fact]=0;
  omp_set_num_threads(4);
  double time = omp_get_wtime();
  #pragma omp parallel for private(t_id)
    for(i=0;i<100000000;i++){
      t_id = omp_get_thread_num();
      x[t_id*fact]++;
  }
  double time_taken = omp_get_wtime() - time;
  printf("%ld %ld %ld %ld %lf
",x[0],x[fact],x[2*fact],x[3*fact],time_taken);
}

Now, the results are understandable,

Set: omp_set_num_threads(1)
Out: 100000000 0 0 0 0.250205

Set: omp_set_num_threads(2)
Out: 50000000 50000000 0 0 0.154980

Set: omp_set_num_threads(3)
Out: 33333334 33333333 33333333 0 0.078874

Set: omp_set_num_threads(4)
Out: 25000000 25000000 25000000 25000000 0.061155

Therefore, it was about the cache line size as explained in the accepted answer. Have a look there to get the answer.

danielschemmel · Accepted Answer

Note that the 4 integers that you are operating on lie very closely together, probably on one cache line. Since cache lines are loaded into the CPU cache in one go, each thread needs to ensure that it has the latest version of that cache line. Since all threads want to modify (and not just read) that one cache line, they are constantly invalidating one another's copy. Welcome to false sharing!

To solve this problem, ensure that the integers are (physically) far enough apart from one another, e.g., by allocating structures that fill (at least) one full cache line for each thread to work with.

When executing your sample program using 4 threads on one of my machines, I got the following result:

25000000 25000000 25000000 25000000 5.049694

When modifying the program, such that the array has 4096 elements, and using the elements 0, 1024, 2048 and 3072 (which ensures enough distance), the program runs a lot faster:

25000000 25000000 25000000 25000000 1.617231

Note that although you are counting the processor time used by the whole process, without false sharing, the time should not increase significantly, but rather be more or less constant (there is some additional locking involved, but it should not usually be on the order of a 10x increase). In fact, the performance boost shown above also translates into wall-clock time (~1.25 seconds to ~500ms).

Open MP bottleneck issue

Answers (2)

Related Questions