Mobius88
Mobius88

Reputation: 111

No speedup with OpenMP

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup. Unfortunately I noticed that I could not get the desired speedup.

So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.

This is the toy example i wrote:

#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"

int main () {
      int number_of_threads = 1;
      int n = 600;
      int m = 50;
      int N = n/number_of_threads;
      int time_limit = 600;
      double total_clock = omp_get_wtime();
      int time_flag = 0;

      #pragma omp parallel num_threads(number_of_threads)
       {
          int thread_id = omp_get_thread_num();
          int iteration_number_local = 0;
          double *C = new double[n]; std::fill(C, C+n, 3.0);
          double *D = new double[n]; std::fill(D, D+n, 3.0);
          double *CD = new double[n]; std::fill(CD, CD+n, 0.0);

          while (time_flag == 0){
                for (int i = 0; i < N; i++)                     
                    for(int z = 0; z < m; z++)
                        for(int x = 0; x < n; x++)
                            for(int c = 0; c < n; c++){
                                CD[c] = C[z]*D[x];
                                C[z] = CD[c] + D[x];
                            }
                iteration_number_local++;
                if ((omp_get_wtime() - total_clock) >= time_limit) 
                    time_flag = 1; 
           }
       #pragma omp critical
       std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
       }
    }

I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).

However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.

Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?

Results

Running the code above I got the following results.

1 thread: 23 iterations.

20 threads: 397-401 iterations per thread (instead of 420-460).

Upvotes: 2

Views: 2999

Answers (3)

Zulan
Zulan

Reputation: 22650

Your measurement methodology is wrong. Especially for small number of iterations.

1 thread: 3 iterations.

3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.

2 threads: 5 iterations per thread (instead of 6).

4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.

20 threads: 40-44 iterations per thread (instead of 60).

40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.

As you can see your results actually do not contradict linear speedup.

It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.

Some reasons (non exhaustive) why you don't see linear speedup are:

  1. Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
  2. Synchronization between threads (e.g. critical sections). Not the case in your toy example.
  3. Load imbalance between threads. Not the case in your toy example.
  4. Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.

From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.

More general advise would be too broad for this format and require more specific information about the non-toy example.

Upvotes: 1

Pierre Guilbert
Pierre Guilbert

Reputation: 39

You should try

   #pragma omp parallel num_threads(number_of_threads)
   {
      int thread_id = omp_get_thread_num();
      int iteration_number_local = 0;
      double *C = new double[n]; std::fill(C, C+n, 3.0);
      double *D = new double[n]; std::fill(D, D+n, 3.0);
      double *CD = new double[n]; std::fill(CD, CD+n, 0.0);

      while (time_flag == 0){
           #pragma omp for 
            for (int i = 0; i < N; i++)                     
                for(int z = 0; z < m; z++)
                    for(int x = 0; x < n; x++)
                        for(int c = 0; c < n; c++)
                            CD[c] = C[z]*D[x];
            iteration_number_local++;
            if ((omp_get_wtime() - total_clock) >= time_limit) 
                time_flag = 1; 
       }
       if(thread_id == 0)
        iteration_number = iteration_number_local;
   }
  std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

Upvotes: 0

Pierre Guilbert
Pierre Guilbert

Reputation: 39

You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :

double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
   {
      int thread_id = omp_get_thread_num();
      int iteration_number_local = 0;  
   } 

Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.

A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or

#pragma omp parallel
{
  #pragma omp for
  {
  }
}

this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster. be careful

#pragma omp parallel
{
  for
  {
  }
}

will make the entire for loop for each thread, which will not speed up your program.

Upvotes: 0

Related Questions