Reputation: 91
I am trying to implement the parallel foreach loop for std::vector
which runs the computations in optimal number of threads (number of cores minus 1 for main thread), however, my implementation seems to be not fast enough – it actually runs 6 times slower than the single-threaded one!
The thread instantiation is often blamed for being a bottleneck so I tried a larger vector, however, that did not seem to help.
I am currently stuck watching the parallel algorithm executed in 13000-20000 microseconds in a separate thread while single-threaded one is executed in 120-200 microseconds in the main thread and cannot figure out what I am doing wrong. Out of those 13-20 ms parallel algorithm runs for 8 or 9 are usually utilized to create thread, however, I can still see no reason for std::for_each
running through 1/3 of the vector in a separate thread for several times longer than another std::for_each
need to iterate through the whole vector.
#include <iostream>
#include <vector>
#include <thread>
#include <algorithm>
#include <chrono>
const unsigned int numCores = std::thread::hardware_concurrency();
const size_t numUse = numCores - 1;
struct foreach
{
inline static void go(std::function<void(uint32_t&)>&& func, std::vector<uint32_t>& cont)
{
std::vector<std::thread> vec;
vec.reserve(numUse);
std::vector<std::vector<uint32_t>::iterator> arr(numUse + 1);
size_t distance = cont.size() / numUse;
for (size_t i = 0; i < numUse; i++)
arr[i] = cont.begin() + i * distance;
arr[numUse] = cont.end();
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
vec.emplace_back([&] { std::for_each(cont.begin() + (numUse - 1) * distance, cont.end(), func); });
for (auto &d : vec)
{
d.join();
}
}
};
int main()
{
std::chrono::steady_clock clock;
std::vector<uint32_t> numbers;
for (size_t i = 0; i < 50000000; i++)
numbers.push_back(i);
std::chrono::steady_clock::time_point t0m = clock.now();
std::for_each(numbers.begin(), numbers.end(), [](uint32_t& value) { ++value; });
std::chrono::steady_clock::time_point t1m = clock.now();
std::cout << "Single-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1m - t0m).count() << "mcs\n";
std::chrono::steady_clock::time_point t0s = clock.now();
foreach::go([](uint32_t& i) { ++i; }, numbers);
std::chrono::steady_clock::time_point t1s = clock.now();
std::cout << "Multi-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1s - t0s).count() << "mcs\n";
getchar();
}
Is there a way I can optimize this and increase the performance?
The compiler I am using is Visual Studio 2017's one. Config is Release x86. I have also been advised to use a profiler and am currently figuring out how to use one.
I actually managed to get parallel code run faster than the regular one, however, this required vector of dozens of thousands of vectors of five elements. If anyone has advices on how to improve performance or where can I find better implementation to check its structure, that would be appreciated.
Upvotes: 1
Views: 702
Reputation: 1692
Thank you for providing some example code.
Getting good metrics (especially on parallel code) can be pretty tricky. Your metrics are tainted.
high_resolution_clock
instead of steady_clock
for profiling.Another issue kind of related to having good metrics is that your "work" is just incrementing an integer. Is that really representative of the work your program is going to be doing? Increment is really fast. If you look at the assembly generated by your sequential version, everything gets inlined into a really short loop.
Lambdas have a very good chance of being inlined. But in your go
function, you're casting the lambda to std::function
. std::function
has a very poor chance of being inlined.
So if you want to keep the chance of getting the lambda inlined, you have to do some template tricks:
template <typename FUNC>
inline static void go(FUNC&& func, std::vector<uint32_t>& cont)
By manually inlining your code (I moved the contents of the go
function to main
) and doing step 2 above, I was able to get the parallel version (4 threads on a hyperthreaded dual-core) to run in about 75% of the time. That's not particularly good scaling, but it's not bad considering that the original was already pretty fast. For a further optimization, I would use SIMD aka "vector" (different from std::vector
except in the sense that they both relate to arrays) operations which will apply the increment to multiple array elements in one iteration.
You have a race condition here:
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
because you set the default lambda capture to capture-by-reference, the i
variable is a reference and that could cause some threads to check the wrong range or too long of a range. You could do this: [&, i]
, but why risk shooting yourself in the foot again? Scott Meyers recommends against using default capture modes. Just do [&cont, &distance, &func, i]
UPDATE:
I think it's a fine idea to move your foreach
to its own space. I think what you should do is separate the thread creation from task dispatch. That means you need some kind of signaling system (generally condition variables). You could look into thread pools.
An easy way to add threadpools is to use OpenMP, which Visual Studio 2017 has support for (OpenMP 2.0). A caveat is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section (it's implementation dependent). So it trades off performance with ease of use.
If you can use C++17, it has a standard parallel for_each
(the ExecutionPolicy
overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each
As for using std::function
you can use it, you just don't want your basic operation (the one that will be called 50,000,000 times) to be a std::function
.
Bad:
void go(std::function<...>& func)
{
std::thread t(std::for_each(v.begin(), v.end(), func));
...
}
...
go([](int& i) { ++i; });
Good:
void go(std::function<...>& func)
{
std::thread t(func);
...
}
...
go([&v](){ std::for_each(v.begin(), v.end(), [](int& i) { ++i; })});
In the good version, the short inner lambda (i.e. ++i) gets inlined in the call to for_each. That's important because it gets called 50 million times. The call to the bigger lambda is not inlined (because it's converted to std::function
) but that's ok because it only gets called once per thread.
Upvotes: 2