Why OpenMP with random number generation is slower than serial code

Question

I'm trying use OpenMP to add parallelism to my program.

std::random_device rd;
std::mt19937 generator(rd());
std::uniform_real_distribution distribution(-0.5, 0.5);

#pragma omp parallel for
for(int i = 0; i < 100000000; i++)
{
    float x = distribution(generator);
}

I tested the code on windows (Visual Studio 2010) and linux (Centos 6.5, gcc 4.9.1) on 12 core processors, and found the parallel version was slower than the serial code.

The results as below:

g++ test.cpp -o test -std=c++11 -Ofast
time ./test
real    0m1.234s
user    0m1.229s
sys 0m0.004s

g++ test.cpp -o test -fopenmp -std=c++11 -Ofast
time ./test
real    0m1.708s
user    0m24.218s
sys 0m0.010s

Why is the the OpenMP is slower than the serial code?

rubenvb · Accepted Answer

You're using one random number generator across many threads. Each call for a new random number will block all the other parallel calls until it is finished.

If you would profile the code, chances are all (or most) of the execution time is spent in some form of mutex locking/unlocking. This problem is called contention and your scenario would be a schoolbook example of how to cause it.

If you use std::threads and give each thread a separate rng, you will achieve pretty much 100% parallelization for that part of the code.

Some code to get you started using std::thread below. Note the use of std::ref

#include 
  using std::array;
#include 
  using std::size_t;
#include 
  using std::ref;
#include 
  using std::cout;
#include 
  using std::iterator_traits;
#include 
  using std::thread;
#include 
  using std::vector;
#include 
  using mersenne_twister = std::mt19937;

template
array series_of_numbers()
{
  array arr;
  for(T i=0; i
void generate_rng(Iterator begin, Iterator end, Engine& engine)
{
  std::uniform_real_distribution dist;
  for(auto it = begin; it != end; ++it)
    *it = dist(engine);
}

int main()
{
  const size_t amount_of_random_numbers = 1024;
  // Engines
  const size_t Nrng = 4;
  auto seed_values = series_of_numbers(); // choose other seeds if you wish
  array engines;
  for(size_t i=0; i threads;
  vector rngs(amount_of_random_numbers);

  // relevant iterators with offsets
  vector::iterator> begins = { rngs.begin(),
                                              rngs.begin() + amount_of_random_numbers/Nrng,
                                              rngs.begin() + 2*amount_of_random_numbers/Nrng,
                                              rngs.begin() + 3*amount_of_random_numbers/Nrng };

  vector::iterator> ends = { rngs.end(),
                                              rngs.end() - 3*amount_of_random_numbers/Nrng,
                                              rngs.end() - 2*amount_of_random_numbers/Nrng,
                                              rngs.end() - amount_of_random_numbers/Nrng };
  // create threads
  for(size_t n=0; n, ref(begins[n]), ref(ends[n]), ref(engines[n])));

  // join threads -> this is where the work will be done.
  for(size_t n=0; n



Live demo at Coliru. And another version where you can actually change the number of threads to any multiple of 4

Why OpenMP with random number generation is slower than serial code

Answers (1)

Related Questions