Reputation: 123
I'm trying use OpenMP to add parallelism to my program.
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_real_distribution<float> distribution(-0.5, 0.5);
#pragma omp parallel for
for(int i = 0; i < 100000000; i++)
{
float x = distribution(generator);
}
I tested the code on windows (Visual Studio 2010) and linux (Centos 6.5, gcc 4.9.1) on 12 core processors, and found the parallel version was slower than the serial code.
The results as below:
g++ test.cpp -o test -std=c++11 -Ofast
time ./test
real 0m1.234s
user 0m1.229s
sys 0m0.004s
g++ test.cpp -o test -fopenmp -std=c++11 -Ofast
time ./test
real 0m1.708s
user 0m24.218s
sys 0m0.010s
Why is the the OpenMP is slower than the serial code?
Upvotes: 0
Views: 1738
Reputation: 76529
You're using one random number generator across many threads. Each call for a new random number will block all the other parallel calls until it is finished.
If you would profile the code, chances are all (or most) of the execution time is spent in some form of mutex locking/unlocking. This problem is called contention and your scenario would be a schoolbook example of how to cause it.
If you use std::thread
s and give each thread a separate rng, you will achieve pretty much 100% parallelization for that part of the code.
Some code to get you started using std::thread
below. Note the use of std::ref
#include <array>
using std::array;
#include <cstddef>
using std::size_t;
#include <functional>
using std::ref;
#include <iostream>
using std::cout;
#include <iterator>
using std::iterator_traits;
#include <thread>
using std::thread;
#include <vector>
using std::vector;
#include <random>
using mersenne_twister = std::mt19937;
template<class T, T N>
array<T, N> series_of_numbers()
{
array<T, N> arr;
for(T i=0; i<N; ++i)
arr[i] = i;
return arr;
}
template<class Iterator, class Engine>
void generate_rng(Iterator begin, Iterator end, Engine& engine)
{
std::uniform_real_distribution<double> dist;
for(auto it = begin; it != end; ++it)
*it = dist(engine);
}
int main()
{
const size_t amount_of_random_numbers = 1024;
// Engines
const size_t Nrng = 4;
auto seed_values = series_of_numbers<size_t, Nrng>(); // choose other seeds if you wish
array<mersenne_twister, Nrng> engines;
for(size_t i=0; i<Nrng; ++i)
engines[i].seed(seed_values[i]);
vector<thread> threads;
vector<double> rngs(amount_of_random_numbers);
// relevant iterators with offsets
vector<vector<double>::iterator> begins = { rngs.begin(),
rngs.begin() + amount_of_random_numbers/Nrng,
rngs.begin() + 2*amount_of_random_numbers/Nrng,
rngs.begin() + 3*amount_of_random_numbers/Nrng };
vector<vector<double>::iterator> ends = { rngs.end(),
rngs.end() - 3*amount_of_random_numbers/Nrng,
rngs.end() - 2*amount_of_random_numbers/Nrng,
rngs.end() - amount_of_random_numbers/Nrng };
// create threads
for(size_t n=0; n<Nrng; ++n)
threads.emplace_back(thread(generate_rng<decltype(begins[n]), mersenne_twister>, ref(begins[n]), ref(ends[n]), ref(engines[n])));
// join threads -> this is where the work will be done.
for(size_t n=0; n<Nrng; ++n)
threads[n].join();
// rngs is filled with magical values!
for(auto number : rngs)
std::cout << number << '\n';
}
Live demo at Coliru. And another version where you can actually change the number of threads to any multiple of 4
Upvotes: 3