Reputation: 2095
Consider this code:
#include <algorithm>
#include <chrono>
#include <cstdio>
#include <execution>
#include <functional>
#include <random>
#include <vector>
using namespace std;
using namespace std::chrono;
constexpr size_t NUM_OF_ELEMENTS = 30000000;
// execute lambda and print the execution time
void measure(function<void()> lambda)
{
auto start = high_resolution_clock::now();
lambda();
auto end = high_resolution_clock::now();
printf("%ld\n", duration_cast<microseconds>(end - start).count());
}
int main()
{
random_device rd;
mt19937_64 gen(rd());
// range from INT_MIN to INT_MAX
uniform_int_distribution<> distr(-2147483648, 2147483647);
vector<int> original;
original.reserve(NUM_OF_ELEMENTS);
for(size_t i = 0; i < NUM_OF_ELEMENTS; i++)
original.push_back(distr(gen));
vector<int> the_copy(original.begin(), original.end());
// sort with single thread
measure([&]{ sort(original.begin(), original.end()); });
// sort with execution::par
measure([&]{ sort(execution::par, the_copy.begin(), the_copy.end()); });
return 0;
}
The code can be summed up in a few points:
std::execution::par
and measure the time of executionThe execution::par
version always takes longer. Doesn't matter what value NUM_OF_ELEMENTS
has. I tried values from 100 000 to 30 000 000 incrementing by 100 000. The code above produces similar results like this (values in microseconds):
9729406 // single thread
10834613 // execution::par
I compiled the code on Windows 10 with gcc using VS Code:
g++ -std=c++17 -g ${workspaceFolder}/main.cpp -o ${workspaceFolder}/main.exe
For C++ standard library I use mingw distribution which can be found here.
Program versions: GCC 10.1.0 + LLVM/Clang/LLD/LLDB 10.0.0 + MinGW-w64 7.0.0
My processor has 6 cores and at the time of execution I didn't run any major programs or background processes.
First, I thought it has something to do with the size of the vector but 30 000 000 elements is surely enough. It already runs for 10 seconds before finishing a single test.
execution::par
meant to be used like this?Upvotes: 3
Views: 1421
Reputation: 1707
running perf on your code, it looks like it spends a tiny bit longer trying to partition the data.
This is just one example, but i ran it several times and consistently the parallel version was taking longer to partition the data at multiple levels of the sort. Since its recursive, its hard to get an exact picture of how much extra overhead it end up adding overall.
sort1 is the non-parallel sort.
sort2 is the parallel sort.
That aside, the answer to the underlying question is that you need intel thread building blocks installed in order for gcc to use anything other than the serial algorithms.
This can be installed quick simply on linux with sudo apt install libtbb-dev
and then you link against it with -ltbb
Upvotes: 2