No speed up with openmp for emplace_back a vector in a loop

Question

I'm trying to emplace_back a vector in a loop with openmp. I took my inspiration from this post : C++ OpenMP Parallel For Loop - Alternatives to std::vector. So I write a test code :

// Example program
#include 
#include 
#include 
#include 
#include 

#include 

int main()
{
  std::cout << "Numbers of thread available : " << omp_get_max_threads() << std::endl;

  std::random_device dev;
  std::mt19937 gen(dev());
  std::uniform_int_distribution distrib(1, 5);

  {
    std::chrono::time_point start, end;
    start = std::chrono::system_clock::now();
    std::vector > result;

#pragma omp declare reduction (merge : std::vector > : omp_out.insert(omp_out.end(), std::make_move_iterator(omp_in.begin()), std::make_move_iterator(omp_in.end())))

#pragma omp parallel for reduction(merge: result)
    for(int i=0; i<100000000; ++i)
      {
        if(distrib(gen) == 1)
          {
            result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
          }
      }
    end = std::chrono::system_clock::now();                               \
    auto elapsed_seconds = std::chrono::duration_cast(end-start).count(); \
    std::cout << "With openmp " << " : " << elapsed_seconds << "ms
";
  }

  {
    std::chrono::time_point start, end;
    start = std::chrono::system_clock::now();
    std::vector > result;


    for(int i=0; i<100000000; ++i)
      {
        if(distrib(gen) == 1)
          {
            result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
          }
      }
    end = std::chrono::system_clock::now();                               \
    auto elapsed_seconds = std::chrono::duration_cast(end-start).count(); \
    std::cout << "Without openmp " << " : " << elapsed_seconds << "ms
";
  }
}

I compile this code with

g++ -o main -std=c++17 -fopenmp main.cpp

and the output is :

Numbers of thread available : 12
With openmp  : 3982ms
Without openmp  : 3887ms

Obviously, I don't have any speed up with my openmp implementation. Why ?

J&#233;r&#244;me Richard · Accepted Answer

The current code is ill-formed regarding the documentation (since the parallelized code contains mostly-implicit dependencies). As a result, an OpenMP implementation is free to generate a fast but completely "wrong" program or a slow "correct" one.

To get a correct implementation and a not-too-bad speedup using OpenMP, one solution is to replicate the generator/distribution in each worker (by moving the variable declarations in a #pragma omp parallel section) and to add critical sections (using #pragma omp critical) for the (sequential) emplace_back.

Due to possible false-sharing and lock-contention, the resulting parallel implementation may scale poorly. It is probably better to generate thread-private arrays and then merge ultimately the sub-arrays in a big shared one rather than using naive critical section (note however that this still not ideal since the computation will likely be limited by the speed of the shared memory).

Please note that the result can be different from the sequential implementation when a specific seed need to be used (here it is not a problem since the seed is extracted from random_device).

No speed up with openmp for emplace_back a vector in a loop

Answers (1)

Related Questions