ferzle
ferzle

Reputation: 103

Open MP poor/confusing performance

Below is Tim Mattson's code from his series of videos on Open MP. The only change I made was to make the number of threads go to 24 since I have a 24 core machine. It is not performing nearly as well as it should and I am baffled as to why (see results below). Am I missing something here? I should mention that I am a theoretical computer scientist with experience in algorithms, but I am a bit rusty when it comes to hardware.

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000;
double step;
int main ()
{
  int i;
  double x, pi, sum = 0.0;
  double start_time, run_time;

  step = 1.0/(double) num_steps;
  for (i=1;i<=24;i++){
    sum = 0.0;
    omp_set_num_threads(i);
    start_time = omp_get_wtime();
#pragma omp parallel  
    {
#pragma omp single
      printf(" num_threads = %d",omp_get_num_threads());

#pragma omp for reduction(+:sum)
      for (i=1;i<= num_steps; i++){
          x = (i-0.5)*step;
          sum = sum + 4.0/(1.0+x*x);
      }
    }

    pi = step * sum;
    run_time = omp_get_wtime() - start_time;
    printf("\n pi is %f in %f seconds and %d threads\n",pi,run_time,i);
  }
}

I expect that with 24 cores it would be 20-24 times faster, but it is barely twice as fast. Why?! Here is the output:

 num_threads = 1
 pi is 3.141593 in 1.531695 seconds and 1 threads
 num_threads = 2
 pi is 3.141594 in 1.405237 seconds and 2 threads
 num_threads = 3
 pi is 3.141593 in 1.313049 seconds and 3 threads
 num_threads = 4
 pi is 3.141592 in 1.069563 seconds and 4 threads
 num_threads = 5
 pi is 3.141587 in 1.058272 seconds and 5 threads
 num_threads = 6
 pi is 3.141590 in 1.016013 seconds and 6 threads
 num_threads = 7
 pi is 3.141579 in 1.023723 seconds and 7 threads
 num_threads = 8
 pi is 3.141582 in 0.760994 seconds and 8 threads
 num_threads = 9
 pi is 3.141585 in 0.791577 seconds and 9 threads
 num_threads = 10
 pi is 3.141593 in 0.868043 seconds and 10 threads
 num_threads = 11
 pi is 3.141592 in 0.797610 seconds and 11 threads
 num_threads = 12
 pi is 3.141592 in 0.802422 seconds and 12 threads
 num_threads = 13
 pi is 3.141590 in 0.941856 seconds and 13 threads
 num_threads = 14
 pi is 3.141591 in 0.928252 seconds and 14 threads
 num_threads = 15
 pi is 3.141592 in 0.867834 seconds and 15 threads
 num_threads = 16
 pi is 3.141593 in 0.830614 seconds and 16 threads
 num_threads = 17
 pi is 3.141592 in 0.856769 seconds and 17 threads
 num_threads = 18
 pi is 3.141591 in 0.907325 seconds and 18 threads
 num_threads = 19
 pi is 3.141592 in 0.880962 seconds and 19 threads
 num_threads = 20
 pi is 3.141592 in 0.855475 seconds and 20 threads
 num_threads = 21
 pi is 3.141592 in 0.825202 seconds and 21 threads
 num_threads = 22
 pi is 3.141592 in 0.759689 seconds and 22 threads
 num_threads = 23
 pi is 3.141592 in 0.751121 seconds and 23 threads
 num_threads = 24
 pi is 3.141592 in 0.745476 seconds and 24 threads

So, what am I missing?

Upvotes: 0

Views: 79

Answers (2)

Makoto
Makoto

Reputation: 106480

In general, with threading, there are two things to consider for speed-up:

  • The size of the task, and if its parts are sufficiently suitable for parallelization
  • The overhead of parallelization itself (e.g. creating threads, killing threads, etc)

Amdahl's law gives us some context into this. Let's be generous and assume that the portion of code which would benefit from this speedup (p) is 0.5, or half of the code. Your assertion that this would make the code 24 times as fast (letting s = 24) results in:

enter image description here

So in theory, you're getting 1.92 times better performance, which isn't the 24x improvement you were hoping for.

Some thoughts on this would be to analyze what parts lend itself better to high amounts of parallelization. Profile this without threads and see if the performance is better than with your current thread layout, too.

Upvotes: 0

1201ProgramAlarm
1201ProgramAlarm

Reputation: 32727

You have one x variable that is shared among all threads.

While the compiler will optimize its use so that you still get the correct result (by keeping the computed value for x in a register), that value is written out to memory every iteration. This will create stalls while cache lines are flushed and reloaded.

The fix is to declare x within the body of the loop where you use it (double x = (i-0.5)*step;), instead of at the top of main.

Upvotes: 3

Related Questions