Derek Jones
Derek Jones

Reputation: 41

why does creating a C++ vector in an openMP parallel for loop cause it to slow significantly?

I am trying to initialize a std::vector while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.

Here is an example of what I am trying:

#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();

    //processor supports 24 threads
//#pragma omp parallel for
    for (int i = 0; i < 24; ++i) {
        
        for (int j = 0; j < 100; ++j) {
            for (int k = 0; k < 100; ++k) {
                for (int l = 0; l < 100; ++l) {
                    int result = i + j + k + l;

                    std::vector<int> resultsOver200 = {};
                    if (result > 200) {
                        resultsOver200.push_back(result);
                    }
                    //do something else with result
                }
            }
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << "Execution time: " << duration.count() << " seconds";
}

Without utilizing omp I get this as a result:

Execution time: 4.60552 seconds

If I uncomment the '#pragma omp parallel for' and run it the results are:

Execution time: 38.7252 seconds

This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?

Also as a note, I definitely want to create the new vector within the innermost for loop.

Upvotes: 3

Views: 178

Answers (3)

Ahmed AEK
Ahmed AEK

Reputation: 17840

The global allocator is usually not made for fast concurrent allocations, there are allocators that are made for fast concurrent allocations which are called scalable allocators (like tbb's scalable allocator and jemalloc and many others), those typically trade memory usage for performance.

in this specific case you could allocate the vector buffer on the stack for a while before spilling off to the heap using std::monotonic_buffer_resource and std::pmr::vector

char buff[4096]; // create a stack buffer

// use the stack buffer till it is full, then fall over to the heap
std::pmr::monotonic_buffer_resource resource{ buff, std::size(buff)};

std::pmr::vector<int> resultsOver200{&resource};

This makes allocations faster even for a single thread, and you will get perfect multithreaded scaling as long as the allocations can fit in the stack buffer.

The downside is that to move out of this vector to a non-pmr one you need to do a member-wise move as this vector's memory is on the stack.

Upvotes: 2

Victor Eijkhout
Victor Eijkhout

Reputation: 5810

While push_back is as efficient as possible (read up on "amortized complexity") it is still very slow. You could reserve your vector to some upper bound, which speeds up the push back considerably.

This does not entirely explain why the parallel version would be slower than the sequential. Could it be that the parallel one overflows your memory and you're swapping to disc?

But really, I would try to reformulate the algorithm. Do you actually need the vector or is there a way to do the resulting computation on it without having it stored?

Upvotes: 0

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50806

Allocations generally does not scale. In fact, contention of the default allocator can make the code significantly slower. Indeed, allocators typically use locks and atomics so when many threads fight for the same shared resources, contention happen and this contention is very expensive: it not only serialize the code but make it slower due to a higher latency (e.g. cache line bouncing NUMA effects, etc.).

To reduce this overhead, you can use an allocator better suited for parallel codes using thread-local pools (e.g. jemalloc). No allocator perfectly scale though and in this case, most will not scale so this is certainly not enough.

One efficient solution to solve this issue is to recycle the vector in order to reduce allocations. Allocations are expensive, even in sequential. You should really avoid them as much as possible in hot loops. You could use custom allocators for that (see the answer of @AhmedAEK) or simply move the vector initialization of resultsOver200 outside all the loop nest and call resultsOver200.resize(0) before using it in the inner-most loop (it will not be reallocated unless the capacity is not large enough). Be careful though: the vector capacity can stay huge so it might be good to force its capacity to be smaller periodically so not to use too much RAM (e.g. by calling resultsOver200.shrink_to_fit() in the i-based or j-based loop).

By the way, you should use steady_clock instead of high_resolution_clock for measuring the wall-clock time.

Upvotes: 7

Related Questions