Dervin Thunk
Dervin Thunk

Reputation: 20129

OpenMP parallel prefix sum speedup

Consider the code below, taken from here. For this code, I get the following execution times:

time ./fibomp 40
Number of threads (OpenMP v200805): 2
finonacci(40) = 102334155

real    0m3.193s
user    0m3.180s
sys     0m0.000s

$ export OMP_NUM_THREADS=1
$ time ./fibomp 40
Number of threads (OpenMP v200805): 1
finonacci(40) = 102334155

real    0m3.224s
user    0m3.216s
sys     0m0.000s

So as you can see, there's not much speed up, definitely not the 2x speedup Ruud mentions in his Tue Nov 01, 2011 1:41 am email. I'm running this on a dual core machine (could that be it?). What am I doing wrong? (BTW, BONUS points, what is the ptime command? Some SPARC Unix command?)

long comp_fib_numbers(int n)
{
  long fnm1, fnm2, fn;
  if ( n == 0 || n == 1 ) return(n);

  // In case the sequence gets too short, execute the serial version
  if ( n < 20 )
  {
     return(comp_fib_numbers(n-1)+comp_fib_numbers(n-2));
  }
  else
  {
     #pragma omp task shared(fnm1)
       fnm1 = comp_fib_numbers(n-1);
     #pragma omp task shared(fnm2)
       fnm2 = comp_fib_numbers(n-2);
     #pragma omp taskwait
       fn = fnm1 + fnm2;
       return(fn);
   }

}

Upvotes: 1

Views: 2987

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74405

First of all, just to be sure, since you state that htop shows that a single core is being used, make sure that you have enabled OpenMP support in your compiler. The option to do so is -fopenmp for GCC, -xopenmp for Sun/Oracle compilers and -openmp for Intel compilers.

Second of all, n = 20 might be too low of a cut off for the parallel implementation. A shameless plug - see this course material from a workshop on OpenMP that a colleague of mine gave some months ago. Several parallel versions with tasking are discussed there, starting from slide 20.

Third, ptime is a Solaris command, not specific to SPARC since it is also available in the x86 version. Many process related Solaris commands have the p prefix in their names. Note that in your case time is more likely to be the built-in implementation that Bash provides rather than the standalone binary.

Fourth, and may be the real answer to your question - you are missing a parallel region in your code so the task directives don't work at all :) You should rewrite your code as following:

long comp_fib_numbers(int n)
{
   long fnm1, fnm2, fn;
   if ( n == 0 || n == 1 ) return(n);

   // In case the sequence gets too short, execute the serial version
   if ( n < 20 )
   {
      return(comp_fib_numbers(n-1)+comp_fib_numbers(n-2));
   }
   else
   {
      #pragma omp parallel  // <--- You are missing this one parallel region
      {
         #pragma omp single
         {
            #pragma omp task shared(fnm1)
            fnm1 = comp_fib_numbers(n-1);
            #pragma omp task shared(fnm2)
            fnm2 = comp_fib_numbers(n-2);
         }
         #pragma omp taskwait
      }

      fn = fnm1 + fnm2;
      return(fn);
   }

}

You could make the code even more terse by using the if clause to switch of the parallel region:

long comp_fib_numbers(int n)
{
   long fnm1, fnm2, fn;
   if ( n == 0 || n == 1 ) return(n);

   #pragma omp parallel if(n >= 20)
   {
      #pragma omp single
      {
         #pragma omp task shared(fnm1)
         fnm1 = comp_fib_numbers(n-1);
         #pragma omp task shared(fnm2)
         fnm2 = comp_fib_numbers(n-2);
      }
      #pragma omp taskwait
   }

   fn = fnm1 + fnm2;
   return(fn);
}

If n happens to be less than 20, then the parallel region would execute single threaded. Since parallel regions are usually extracted in separate functions, there would still be an additional function call, unless the compiler choses to produce duplicate code. That's why it is recommended that the serial implementation is extracted in its own function:

long comp_fib_numbers_serial(int n)
{
   if ( n == 0 || n == 1 ) return(n);

   return (comp_fib_numbers_serial(n-1) + comp_fib_numbers_serial(n-2));
}

long comp_fib_numbers(int n)
{
   long fnm1, fnm2, fn;
   if ( n < 20 ) return comp_fib_numbers_serial(n);

   #pragma omp parallel
   {
      #pragma omp single
      {
         #pragma omp task shared(fnm1)
         fnm1 = comp_fib_numbers(n-1);
         #pragma omp task shared(fnm2)
         fnm2 = comp_fib_numbers(n-2);
      }
      #pragma omp taskwait
   }

   fn = fnm1 + fnm2;
   return(fn);
}

Edit: Now that I've looked at the code that you have linked to, I can see that the call to comp_fib_numbers is embedded into a parallel region. So just disregard my comment about the missing parallel region if you already have one in your code. I will leave it here just for completeness. Try tweaking the value at which the switch between the parallel and the serial version occurs. On modern processors it might be quite high and the example that you have seen is quite old. Also make sure that no dynamic teams are used by either setting the environment variable OMP_DYNAMIC to false (or to FALSE) or by calling omp_set_dynamic(0); someplace before the parallel region.

You haven't stated what your compiler is but mind that OpenMP 3.0 is supported by GCC since version 4.4, by Intel compilers since version 11.0, by Sun/Oracle compilers since version I_dont_know and is not supported at all by the Visual C/C++ compilers.

Observed speedup on an quad-socket Intel Xeon X7350 system (old pre-Nehalem system with FSB)

$ time OMP_NUM_THREADS=1 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=1 ./fib.x 40  1.86s user 0.00s system 99% cpu 1.866 total
$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40  1.96s user 0.00s system 169% cpu 1.161 total

With the cut-off set to 25 (seems to be the optimal value for the X7350):

$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40  1.95s user 0.00s system 169% cpu 1.153 total

With the cut-off set to 25 and a separate function for the serial implementation:

$ time OMP_NUM_THREADS=2 ./fib.x 40
finonacci(40) = 102334155
OMP_NUM_THREADS=2 ./fib.x 40  1.52s user 0.00s system 171% cpu 0.889 total

See how the user time decreases by some 400 ms. This is because of the removed overhead.

These were measured with the code from the site that you have linked to. Used compiler is GCC 4.4.6 on a 64-bit Scientific Linux 6.2 system.

Upvotes: 4

Related Questions