Reputation: 2337
I am trying to multithread the p2-loop
in the inner loop of the following nested loops in C++. I don't want the outer loop to be multithreaded due to shared data.
// OUTER LOOP
for ( int l = 0; l < NT; l++ ) {
for ( int k = 0; k < NZ; k++ ) {
for ( int j = 0; j < NY; j++ ) {
for ( int i = 0; i < NX; i++ ) {
// define pStart and pEnd
// ...
// INNER LOOP (a 16x16x16 loop and it takes 20us on a single thread)
for ( int p2 = p2Start; p2 < p2End; p2++ ) {
for ( int p1 = p1Start; p1 < p1End; p1++ ) {
for ( int p0 = p0Start; p0 < p0End; p0++ ) {
// performing inner loop computation...
std::this_thread::sleep_for(std::chrono::microseconds(20));
}
}
}
I have tried bshoshany/thread-pool:
BS::thread_pool pool;
BS::multi_future<void> loop_future = pool.submit_loop(p2Start, p2End, [&](int p2) {
for ( int p1 = p1Start; p1 < p1End; p1++ ) {
for ( int p0 = p0Start; p0 < p0End; p0++ ) {
// performing inner loop computation...
std::this_thread::sleep_for(std::chrono::microseconds(20));
}
}
});
as well as Intel TBB parallel_for
and they both have a huge overhead and significantly increase 20us
with no gained benefits.
tbb::parallel_for(p2Start, p2End, [&](int p2) {
for ( int p1 = p1Start; p1 < p1End; p1++ ) {
for ( int p0 = p0Start; p0 < p0End; p0++ ) {
// performing inner loop computation...
std::this_thread::sleep_for(std::chrono::microseconds(20));
}
}
});
Is there any way to efficiently use multiple threads for this task with very minimal overhead to gain speedup?
Upvotes: 0
Views: 54