C++ multithreading performance slower than single threaded code

Question

I'm learning to use thread in c++
I created a very long vector with integers and set another integer x. And I want to calculate the difference between that integer and the integers in the vector.

However, in my implementation, the function using two threads is slower than a single thread function. I wonder why is the reason, and how can I implement threading correctly so it does run faster.

Here's the code:

#include 
#include 
#include 
#include 
#include 

using namespace std;


vector vector_generator(int size) {
    vector temp;
    for (int i = 0; i < size; i++) {
        temp.push_back(i);
    }
    return temp;
}

vector dist_calculation(int center, vector &input, int start, int end) {
    vector temp;
    for (int i = start; i < end; i++) {
        temp.push_back(abs(center - input[i]));
    }
    return temp;
}


void multi_dist_calculation(int center, vector &input) {
    int mid = input.size() / 2;

    vector temp1(input.begin(), input.begin() + mid);
    vector temp2(input.begin()+mid, input.end());

    auto future1 = async(dist_calculation, center, temp1, 0, mid);
    auto future2 = async(dist_calculation, center, temp2, 0, mid);

    vector result1 = future1.get();
    vector result2 = future2.get();

    return;
}


int main() {

    vector v1 = vector_generator(1000000000);
    vector result;
    multi_dist_calculation(0, v1);
    //dist_calculation(50, v1, 0, v1.size());

    return 0;
}

Update #1

Added suggestions of std::launch::async & reserve(), and it does make the code faster. But the 2-threaded function is still slower than single threaded one. Can I say in this kind of calculation, single-threaded is faster?

#include 
#include 
#include 
#include 
#include 

using namespace std;


vector vector_generator(int size) {
    vector temp;
    temp.reserve(size);
    for (int i = 0; i < size; i++) {
        temp.push_back(i);
    }
    return temp;
}

vector dist_calculation(int center, vector &input, int start, int end) {
    vector temp;
    temp.reserve(end - start);
    for (int i = start; i < end; i++) {
        temp.push_back(abs(center - input[i]));
    }
    return temp;
}


void multi_dist_calculation(int center, vector &input) {
    int mid = input.size() / 2;

    auto future1 = async(std::launch::async, dist_calculation, center, input,   0, mid);
    auto future2 = async(std::launch::async, dist_calculation, center, input, mid, input.size());

    vector result1 = future1.get();
    vector result2 = future2.get();

    return;
}


int main() {

    vector v1 = vector_generator(1000000000);
    vector result;
    int center = 0;
    multi_dist_calculation(center, v1);
    //dist_calculation(center, v1, 0, v1.size());

    return 0;
}

Fire Lancer · Accepted Answer

You did not pass any std::launch policy to std::async, so it leaves the implementation a lot of freedom.

Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred. In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.

But also be aware that more generally, using more threads, especially on small tasks may not be faster.

Where dist_calculation or any task you want to thread is a small amount of work, be aware of the overheads. Creating a new thread has a relatively high cost, and there is also overhead for whatever internal pool std::async uses, promises, and futures.
Additionally, the way written, it is likely you create more vectors, with more dynamic memory, and you will need to merge the results which will also have some cost.
In more complex cases, if synchronization, e.g. with std::mutex is involved, that may cost more performance than additional threads gain.
In some cases, the bottleneck will not be the CPU. For example it may be disk/storage speed (including for page/swap file), network speed (including remote servers), or even memory bandwidth (excepting NUMA aware optimizations, they are a lot more complex than just using std::async). Multithreading in these will just add overheads, but no benefit.

You should make use of other basic optimisations where possible first, such as reserve the size of the vectors to avoid unneeded allocations and copies, maybe resize and use the vector[index] = a instead of push_back, etc.

For something as simple as abs(centre - input[i]) you might get a lot more improvement from SIMD (Single instruction multiple data) optimisations. e.g. make sure you are compiling with any optimizations such as SSE2 enabled, and if the compiler doesn't optimise the loop appropriately (I think the push_back may interfere, test!), alter it slightly so it does, or maybe even use the vector instructions explicitly (for x86 check out _mm_add_epi32 etc.).

C++ multithreading performance slower than single threaded code

Answers (2)

Related Questions