How do I properly parallelise a for loop using OpenMP?

Question

I am testing OpenMP for C++ because my software will be heavily reliant on speed from processor parallelisation.

I am getting strange results when running the following code.

The speed up from parallelisation is not as much as I would expect
When not using -O flags, the code runs slower.

I am using the g++ compiler, version 7.3.0 and Ubuntu 18.04 OS on an i5-8600 CPU with 16 GB RAM.

Outputs:

Output 1 (Not allowed to embed yet since I'm a new member)

Transript:

.../OpenMPTest$ g++ -O3 -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 2.87415 seconds.
Parallel action took: 0.99954 seconds.

Output 2

.../OpenMPTest$ g++ -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 25.7037 seconds.
Parallel action took: 68.0485 seconds.

As you can see, for 6 processors I'm only getting ~2.9 times improvement in speed, unless I omit the -O flags, in which case the program runs much slower, but still uses all 6 processors at 100% utilisation (tested using htop).

Why is this? Also, what can I do to achieve the full 6x increase in performance?

Source code:

#include 
#include 
#include 
#include 
#include 
#include 

int main() {

    using namespace std::chrono;

    const int big_number = 1000000000;
    std::array array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };

    // Sequential

    high_resolution_clock::time_point start_linear = high_resolution_clock::now();

    for(int i = 0; i < 6; i++) {
        for(int j = 0; j < big_number; j++) {
            array[i]++;
        }   
    }

    high_resolution_clock::time_point end_linear = high_resolution_clock::now();

    // Parallel 

    high_resolution_clock::time_point start_parallel = high_resolution_clock::now();

    array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

    #pragma omp parallel
    {
        #pragma omp for
        for(int i = 0; i < 6; i++) {
            for(int j = 0; j < big_number; j++) {
                array[i]++;
            }   
        }
    }

    high_resolution_clock::time_point end_parallel = high_resolution_clock::now();

    // Stats.

    std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;

    duration time_span = duration_cast>(end_linear - start_linear);
    std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;

    time_span = duration_cast>(end_parallel - start_parallel);
    std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;

    return EXIT_SUCCESS;
}

user338371 · Accepted Answer

It seems your code was influenced by false sharing.

don't let different threads access the same cache line.A better way is try to not share variables between threads.

#include 
#include 
#include 
#include 
#include 
#include 

int main() {

  using namespace std::chrono;

  const int big_number = 1000000000;
  alignas(64) std::array array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };

  // Sequential

  high_resolution_clock::time_point start_linear = high_resolution_clock::now();

  for(int i = 0; i < 6; i++) {
    for(int j = 0; j < big_number; j++) {
      array[i]++;
    }
  }

  high_resolution_clock::time_point end_linear = high_resolution_clock::now();

  // Parallel

  high_resolution_clock::time_point start_parallel = high_resolution_clock::now();

  array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

      #pragma omp parallel
  {
            #pragma omp for
    for(int i = 0; i < 6; i++) {
      for(int j = 0; j < big_number; j++) {
        array[i*8]++;
      }
    }
  }

  high_resolution_clock::time_point end_parallel = high_resolution_clock::now();

  // Stats.

  std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;

  duration time_span = duration_cast>(end_linear - start_linear);
  std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;

  time_span = duration_cast>(end_parallel - start_parallel);
  std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;

  return EXIT_SUCCESS;
}

8 processors used.

Linear action took: 26.9021 seconds.

Parallel action took: 6.41319 seconds.

And you can read this.

How do I properly parallelise a for loop using OpenMP?

Answers (1)

Related Questions