Reputation: 361
I'm learning to use thread in c++
I created a very long vector with integers and set another integer x. And I want to calculate the difference between that integer and the integers in the vector.
However, in my implementation, the function using two threads is slower than a single thread function. I wonder why is the reason, and how can I implement threading correctly so it does run faster.
Here's the code:
#include <iostream>
#include <vector>
#include <thread>
#include <future>
#include <math.h>
using namespace std;
vector<int> vector_generator(int size) {
vector<int> temp;
for (int i = 0; i < size; i++) {
temp.push_back(i);
}
return temp;
}
vector<int> dist_calculation(int center, vector<int> &input, int start, int end) {
vector<int> temp;
for (int i = start; i < end; i++) {
temp.push_back(abs(center - input[i]));
}
return temp;
}
void multi_dist_calculation(int center, vector<int> &input) {
int mid = input.size() / 2;
vector<int> temp1(input.begin(), input.begin() + mid);
vector<int> temp2(input.begin()+mid, input.end());
auto future1 = async(dist_calculation, center, temp1, 0, mid);
auto future2 = async(dist_calculation, center, temp2, 0, mid);
vector<int> result1 = future1.get();
vector<int> result2 = future2.get();
return;
}
int main() {
vector<int> v1 = vector_generator(1000000000);
vector<int> result;
multi_dist_calculation(0, v1);
//dist_calculation(50, v1, 0, v1.size());
return 0;
}
Update #1
Added suggestions of std::launch::async & reserve(), and it does make the code faster. But the 2-threaded function is still slower than single threaded one. Can I say in this kind of calculation, single-threaded is faster?
#include <iostream>
#include <vector>
#include <thread>
#include <future>
#include <math.h>
using namespace std;
vector<int> vector_generator(int size) {
vector<int> temp;
temp.reserve(size);
for (int i = 0; i < size; i++) {
temp.push_back(i);
}
return temp;
}
vector<int> dist_calculation(int center, vector<int> &input, int start, int end) {
vector<int> temp;
temp.reserve(end - start);
for (int i = start; i < end; i++) {
temp.push_back(abs(center - input[i]));
}
return temp;
}
void multi_dist_calculation(int center, vector<int> &input) {
int mid = input.size() / 2;
auto future1 = async(std::launch::async, dist_calculation, center, input, 0, mid);
auto future2 = async(std::launch::async, dist_calculation, center, input, mid, input.size());
vector<int> result1 = future1.get();
vector<int> result2 = future2.get();
return;
}
int main() {
vector<int> v1 = vector_generator(1000000000);
vector<int> result;
int center = 0;
multi_dist_calculation(center, v1);
//dist_calculation(center, v1, 0, v1.size());
return 0;
}
Upvotes: 4
Views: 1844
Reputation: 15933
If I understand this correctly, then your singlethreaded version simply calls dist_calculation
on the given vector, which will run through the vector once and produce a new vector which it returns. Your multithreaded version, on the other hand, first copies each half of the input data to two separate vectors. After that, it launches the dist_calculation
for each half, potentially in two threads. std::async
may not even run using a separate thread because you didn't specify a launch policy. If it happens to run using a separate thread, the vectors you pass will be copied once more due to the way the std::thread
constructor works. You'd have to use, e.g., an std::reference_wrapper
to properly pass a reference. Think about what's going on there. Copying the data to these two vectors means that you're already running through the entire data once before you even get to the point where you'd be launching some threads to perform the actual computation, if threads are even going to be used to begin with. And if they are, you're copying the data a second time. Both times, the copy happens in a single thread. So just to get to the point where your actual multithreaded computation starts—if it does actually happen using multiple threads—your "multithreaded" version has basically had to do three times the work that your singlethreaded version will do in total. And it had to do all of that in a single thread…
Your dist_calculation
does not modify the original data. Furthermore, the individual threads are going to access completely separate parts of the data. Thus, there is no need to ever copy the data at all. Simply give each thread a pointer/iterator to the part of the original data it should process. Also, you know beforehand how many elements are going to be produced. So rather than continuously growing two separate output vectors, you could just preallocate a vector to receive the output and give each thread yet another pointer to the subrange of the output vector it should write to.
And finally, starting with C++17, you could just use the parallel versions of std::transform with std::execution::par_unseq
and get yourself an automagically parallelized (potentially even vectorized) solution. For example:
template <typename ExecutionPolicy, typename ForwardIterator, typename OutputIterator>
void multi_dist_calculation(OutputIterator dest, ForwardIterator begin, ForwardIterator end, int center, ExecutionPolicy&& execution_policy)
{
std::transform(std::forward<ExecutionPolicy>(execution_policy), begin, end, dest, [center](int x)
{
return std::abs(center - x);
});
}
int main()
{
constexpr int data_size = 1000000000;
auto data = std::unique_ptr<int[]> { new int[data_size] };
std::iota(&data[0], &data[0] + data_size, 0);
auto result = std::unique_ptr<int[]> { new int[data_size] };
multi_dist_calculation(&result[0], &data[0], &data[0] + data_size, 0, std::execution::par_unseq); // multithreaded
// multi_dist_calculation(&result[0], &data[0], &data[0] + data_size, 0, std::execution::seq); // singlethreaded
return 0;
}
Upvotes: 2
Reputation: 30105
You did not pass any std::launch policy
to std::async
, so it leaves the implementation a lot of freedom.
Behaves as if (2) is called with policy being std::launch::async | std::launch::deferred. In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.
But also be aware that more generally, using more threads, especially on small tasks may not be faster.
dist_calculation
or any task you want to thread is a small amount of work, be aware of the overheads. Creating a new thread has a relatively high cost, and there is also overhead for whatever internal pool std::async
uses, promises, and futures.std::mutex
is involved, that may cost more performance than additional threads gain.std::async
). Multithreading in these will just add overheads, but no benefit.You should make use of other basic optimisations where possible first, such as reserve
the size of the vectors to avoid unneeded allocations and copies, maybe resize
and use the vector[index] = a
instead of push_back
, etc.
For something as simple as abs(centre - input[i])
you might get a lot more improvement from SIMD (Single instruction multiple data) optimisations. e.g. make sure you are compiling with any optimizations such as SSE2
enabled, and if the compiler doesn't optimise the loop appropriately (I think the push_back
may interfere, test!), alter it slightly so it does, or maybe even use the vector instructions explicitly (for x86 check out _mm_add_epi32
etc.).
Upvotes: 6