User800222
User800222

Reputation: 361

C++ multithreading performance slower than single threaded code

I'm learning to use thread in c++
I created a very long vector with integers and set another integer x. And I want to calculate the difference between that integer and the integers in the vector.

However, in my implementation, the function using two threads is slower than a single thread function. I wonder why is the reason, and how can I implement threading correctly so it does run faster.

Here's the code:

#include <iostream>
#include <vector>
#include <thread>
#include <future>
#include <math.h>

using namespace std;


vector<int> vector_generator(int size) {
    vector<int> temp;
    for (int i = 0; i < size; i++) {
        temp.push_back(i);
    }
    return temp;
}

vector<int> dist_calculation(int center, vector<int> &input, int start, int end) {
    vector<int> temp;
    for (int i = start; i < end; i++) {
        temp.push_back(abs(center - input[i]));
    }
    return temp;
}


void multi_dist_calculation(int center, vector<int> &input) {
    int mid = input.size() / 2;

    vector<int> temp1(input.begin(), input.begin() + mid);
    vector<int> temp2(input.begin()+mid, input.end());

    auto future1 = async(dist_calculation, center, temp1, 0, mid);
    auto future2 = async(dist_calculation, center, temp2, 0, mid);

    vector<int> result1 = future1.get();
    vector<int> result2 = future2.get();

    return;
}


int main() {

    vector<int> v1 = vector_generator(1000000000);
    vector<int> result;
    multi_dist_calculation(0, v1);
    //dist_calculation(50, v1, 0, v1.size());

    return 0;
}



Update #1

Added suggestions of std::launch::async & reserve(), and it does make the code faster. But the 2-threaded function is still slower than single threaded one. Can I say in this kind of calculation, single-threaded is faster?

#include <iostream>
#include <vector>
#include <thread>
#include <future>
#include <math.h>

using namespace std;


vector<int> vector_generator(int size) {
    vector<int> temp;
    temp.reserve(size);
    for (int i = 0; i < size; i++) {
        temp.push_back(i);
    }
    return temp;
}

vector<int> dist_calculation(int center, vector<int> &input, int start, int end) {
    vector<int> temp;
    temp.reserve(end - start);
    for (int i = start; i < end; i++) {
        temp.push_back(abs(center - input[i]));
    }
    return temp;
}


void multi_dist_calculation(int center, vector<int> &input) {
    int mid = input.size() / 2;

    auto future1 = async(std::launch::async, dist_calculation, center, input,   0, mid);
    auto future2 = async(std::launch::async, dist_calculation, center, input, mid, input.size());

    vector<int> result1 = future1.get();
    vector<int> result2 = future2.get();

    return;
}


int main() {

    vector<int> v1 = vector_generator(1000000000);
    vector<int> result;
    int center = 0;
    multi_dist_calculation(center, v1);
    //dist_calculation(center, v1, 0, v1.size());

    return 0;
}

Upvotes: 4

Views: 1844

Answers (2)

Michael Kenzel
Michael Kenzel

Reputation: 15933

If I understand this correctly, then your singlethreaded version simply calls dist_calculation on the given vector, which will run through the vector once and produce a new vector which it returns. Your multithreaded version, on the other hand, first copies each half of the input data to two separate vectors. After that, it launches the dist_calculation for each half, potentially in two threads. std::async may not even run using a separate thread because you didn't specify a launch policy. If it happens to run using a separate thread, the vectors you pass will be copied once more due to the way the std::thread constructor works. You'd have to use, e.g., an std::reference_wrapper to properly pass a reference. Think about what's going on there. Copying the data to these two vectors means that you're already running through the entire data once before you even get to the point where you'd be launching some threads to perform the actual computation, if threads are even going to be used to begin with. And if they are, you're copying the data a second time. Both times, the copy happens in a single thread. So just to get to the point where your actual multithreaded computation starts—if it does actually happen using multiple threads—your "multithreaded" version has basically had to do three times the work that your singlethreaded version will do in total. And it had to do all of that in a single thread…

Your dist_calculation does not modify the original data. Furthermore, the individual threads are going to access completely separate parts of the data. Thus, there is no need to ever copy the data at all. Simply give each thread a pointer/iterator to the part of the original data it should process. Also, you know beforehand how many elements are going to be produced. So rather than continuously growing two separate output vectors, you could just preallocate a vector to receive the output and give each thread yet another pointer to the subrange of the output vector it should write to.

And finally, starting with C++17, you could just use the parallel versions of std::transform with std::execution::par_unseq and get yourself an automagically parallelized (potentially even vectorized) solution. For example:

template <typename ExecutionPolicy, typename ForwardIterator, typename OutputIterator>
void multi_dist_calculation(OutputIterator dest, ForwardIterator begin, ForwardIterator end, int center, ExecutionPolicy&& execution_policy)
{
    std::transform(std::forward<ExecutionPolicy>(execution_policy), begin, end, dest, [center](int x)
    {
        return std::abs(center - x);
    });
}

int main()
{
    constexpr int data_size = 1000000000;
    auto data = std::unique_ptr<int[]> { new int[data_size] };
    std::iota(&data[0], &data[0] + data_size, 0);

    auto result = std::unique_ptr<int[]> { new int[data_size] };
    multi_dist_calculation(&result[0], &data[0], &data[0] + data_size, 0, std::execution::par_unseq);  // multithreaded
    // multi_dist_calculation(&result[0], &data[0], &data[0] + data_size, 0, std::execution::seq);  // singlethreaded

    return 0;
}

Upvotes: 2

Fire Lancer
Fire Lancer

Reputation: 30105

You did not pass any std::launch policy to std::async, so it leaves the implementation a lot of freedom.

Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred. In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.

But also be aware that more generally, using more threads, especially on small tasks may not be faster.

  • Where dist_calculation or any task you want to thread is a small amount of work, be aware of the overheads. Creating a new thread has a relatively high cost, and there is also overhead for whatever internal pool std::async uses, promises, and futures.
  • Additionally, the way written, it is likely you create more vectors, with more dynamic memory, and you will need to merge the results which will also have some cost.
  • In more complex cases, if synchronization, e.g. with std::mutex is involved, that may cost more performance than additional threads gain.
  • In some cases, the bottleneck will not be the CPU. For example it may be disk/storage speed (including for page/swap file), network speed (including remote servers), or even memory bandwidth (excepting NUMA aware optimizations, they are a lot more complex than just using std::async). Multithreading in these will just add overheads, but no benefit.

You should make use of other basic optimisations where possible first, such as reserve the size of the vectors to avoid unneeded allocations and copies, maybe resize and use the vector[index] = a instead of push_back, etc.

For something as simple as abs(centre - input[i]) you might get a lot more improvement from SIMD (Single instruction multiple data) optimisations. e.g. make sure you are compiling with any optimizations such as SSE2 enabled, and if the compiler doesn't optimise the loop appropriately (I think the push_back may interfere, test!), alter it slightly so it does, or maybe even use the vector instructions explicitly (for x86 check out _mm_add_epi32 etc.).

Upvotes: 6

Related Questions