Why does sorting a vector with execution::par take longer than normal sort (gcc 10.1.0)?

Question

Consider this code:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
using namespace std;
using namespace std::chrono;

constexpr size_t NUM_OF_ELEMENTS = 30000000;

// execute lambda and print the execution time
void measure(function lambda)
{
    auto start = high_resolution_clock::now();
    lambda();
    auto end = high_resolution_clock::now();
    
    printf("%ld
", duration_cast(end - start).count());
}

int main()
{
    random_device rd;
    mt19937_64 gen(rd());
    // range from INT_MIN to INT_MAX
    uniform_int_distribution<> distr(-2147483648, 2147483647);
    
    vector original;
    original.reserve(NUM_OF_ELEMENTS);
    
    for(size_t i = 0; i < NUM_OF_ELEMENTS; i++)
        original.push_back(distr(gen));
    
    vector the_copy(original.begin(), original.end());
    
    // sort with single thread
    measure([&]{ sort(original.begin(), original.end()); });
    
    // sort with execution::par
    measure([&]{ sort(execution::par, the_copy.begin(), the_copy.end()); });
    return 0;
}

The code can be summed up in a few points:

create random number generator
create vector of random integers
create a copy of that vector
sort original vector with one thread and measure the time of execution
sort the copy with std::execution::par and measure the time of execution
print the execution times

The execution::par version always takes longer. Doesn't matter what value NUM_OF_ELEMENTS has. I tried values from 100 000 to 30 000 000 incrementing by 100 000. The code above produces similar results like this (values in microseconds):

9729406    // single thread
10834613   // execution::par

I compiled the code on Windows 10 with gcc using VS Code: g++ -std=c++17 -g ${workspaceFolder}/main.cpp -o ${workspaceFolder}/main.exe

For C++ standard library I use mingw distribution which can be found here. Program versions: GCC 10.1.0 + LLVM/Clang/LLD/LLDB 10.0.0 + MinGW-w64 7.0.0

My processor has 6 cores and at the time of execution I didn't run any major programs or background processes.

First, I thought it has something to do with the size of the vector but 30 000 000 elements is surely enough. It already runs for 10 seconds before finishing a single test.

What is going on?
Is execution::par meant to be used like this?
Do I have to enable some compilation flags or some other trick to make it work as expected?

Taekahn · Accepted Answer

running perf on your code, it looks like it spends a tiny bit longer trying to partition the data.

This is just one example, but i ran it several times and consistently the parallel version was taking longer to partition the data at multiple levels of the sort. Since its recursive, its hard to get an exact picture of how much extra overhead it end up adding overall.

sort1 is the non-parallel sort.
sort2 is the parallel sort.

That aside, the answer to the underlying question is that you need intel thread building blocks installed in order for gcc to use anything other than the serial algorithms.
This can be installed quick simply on linux with sudo apt install libtbb-dev and then you link against it with -ltbb

Why does sorting a vector with execution::par take longer than normal sort (gcc 10.1.0)?

Answers (1)

Related Questions