Reputation: 820
I'm trying to emplace_back a vector in a loop with openmp. I took my inspiration from this post : C++ OpenMP Parallel For Loop - Alternatives to std::vector. So I write a test code :
// Example program
#include <iostream>
#include <string>
#include <vector>
#include <random>
#include <chrono>
#include <omp.h>
int main()
{
std::cout << "Numbers of thread available : " << omp_get_max_threads() << std::endl;
std::random_device dev;
std::mt19937 gen(dev());
std::uniform_int_distribution<unsigned> distrib(1, 5);
{
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
std::vector<std::pair<uint32_t, uint32_t> > result;
#pragma omp declare reduction (merge : std::vector<std::pair<uint32_t, uint32_t> > : omp_out.insert(omp_out.end(), std::make_move_iterator(omp_in.begin()), std::make_move_iterator(omp_in.end())))
#pragma omp parallel for reduction(merge: result)
for(int i=0; i<100000000; ++i)
{
if(distrib(gen) == 1)
{
result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
}
}
end = std::chrono::system_clock::now(); \
auto elapsed_seconds = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count(); \
std::cout << "With openmp " << " : " << elapsed_seconds << "ms\n";
}
{
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
std::vector<std::pair<uint32_t, uint32_t> > result;
for(int i=0; i<100000000; ++i)
{
if(distrib(gen) == 1)
{
result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
}
}
end = std::chrono::system_clock::now(); \
auto elapsed_seconds = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count(); \
std::cout << "Without openmp " << " : " << elapsed_seconds << "ms\n";
}
}
I compile this code with
g++ -o main -std=c++17 -fopenmp main.cpp
and the output is :
Numbers of thread available : 12
With openmp : 3982ms
Without openmp : 3887ms
Obviously, I don't have any speed up with my openmp implementation. Why ?
Upvotes: 0
Views: 614
Reputation: 50806
The current code is ill-formed regarding the documentation (since the parallelized code contains mostly-implicit dependencies). As a result, an OpenMP implementation is free to generate a fast but completely "wrong" program or a slow "correct" one.
To get a correct implementation and a not-too-bad speedup using OpenMP, one solution is to replicate the generator/distribution in each worker (by moving the variable declarations in a #pragma omp parallel
section) and to add critical sections (using #pragma omp critical
) for the (sequential) emplace_back
.
Due to possible false-sharing and lock-contention, the resulting parallel implementation may scale poorly. It is probably better to generate thread-private arrays and then merge ultimately the sub-arrays in a big shared one rather than using naive critical section (note however that this still not ideal since the computation will likely be limited by the speed of the shared memory).
Please note that the result can be different from the sequential implementation when a specific seed need to be used (here it is not a problem since the seed is extracted from random_device
).
Upvotes: 0