Reputation: 4793
I use std::for_each
with std::execution::par
to perform complex computation on huge input represented as vector of structures. The computation doesn't need any delays related to hardware (network or disk IO for example), it is "just CPU" computation. For me it looks logical that there are no sense to create more OS threads that we have hardware ones; however, Visual C++ 2019 creates in average 50 threads, and sometimes up to 500 ones even there are only 12 hardware threads.
Is there a way to limit parallel threads count to hardware_concurrency
with std::for_each
and std::execution::par
, or the only way to create reasonable threads count is to use custom code with std::thread
?
Upvotes: 18
Views: 5221
Reputation: 11
Regarding "For me it looks logical that there are no sense to create more OS threads that we have hardware ones" - this is not always true: due to the intensive use of pipelining at CPU level there are a lot of dead cycles, and this idle time can be used by another thread to do some processing. So, no - most of the algorithms that are not very cache intensive will be able to use significantly more threads than the native ones to increase performance. Nevertheless, the optimal multiplier depends on the algorithm and particular CPU, you need to benchmark individually.
For example, in one of my implementations of a sieving algorithm, the sieving phase was very cache intensive, so there it made very much sense to limit to the native cores to stay within L1 cache. On the other hand, the counting phase was mostly arithmetic, and by experimenting I determined that the optimal number of threads in that phase was 300 on a 12 cores 7900X CPU.
Upvotes: 1
Reputation: 7591
Is it possible to limit threads count for C++ 17 parallel
for_each
?
No, at least not in C++17.
However, there is a proposal for executors
in a standard to come, which basically gives you the ability to influence the execution context (in terms of location and time) for the high-level STL algorithm interface:
thread_pool pool{ std::thread::hardware_concurrency() };
auto exec = pool.executor();
std::for_each(std::execution::par.on(exec), begin(data), end(data), some_operation);
Up to then, you have to either trust your compiler vendor that he knows what is best for the overall performance, as e.g. the developers of Visual Studio state:
Scheduling in our implementation is handled by the Windows system thread pool. The thread pool takes advantage of information not available to the standard library, such as what other threads on the system are doing, what kernel resources threads are waiting for, and similar. It chooses when to create more threads, and when to terminate them. It’s also shared with other system components, including those not using C++.
The other option would be to give up on solely relying on the standard library and use STL implementations which already feature the new proposal.
Upvotes: 14