Reputation: 6816
I'm using concurrency::parallel_for()
in a Windows app (Visual Studio 2017) to divide up some work in a loop. Everything worked just fine but I was concerned about locking performance so i tried various methods: std::mutex
, windows CRITICAL_SECTION
, etc.
Then I tried concurrency::critical_section
. The documentation makes it sound like it should be faster since it is aware of the concurrency runtime.
Nope. Not only is it not faster, in some cases it's downright dangerous. At first it was just blowing up my application. In the debugger I could see concurrency just creating infinite threads. When I changed the partitioner from the default to the static one, well things started working again but everything was much, much slower than with windows CRITICAL_SECTION
or even std::mutex
I'm wondering if someone can explain to me either of the following
concurrency::critical_section
in my lambda (below) with the default partitioner make concurrency create infinite threads?static_partioner
) is the parallel_for()
so much slower with concurrency::critical_section
than other locking mechanisms?concurrency::critical_section
?Here is my code
#include <ppl.h>
void nonlinearReconstruction(const std::vector<Image>& window,
const Rect& rect,
Image& normals)
{
concurrency::critical_section mtx;
// This lambda uses the critical section "mtx" to control
// access to the shared image data in "normals". Read pixels,
// does math on them, and then sets other pixels.
const auto op =
[&normals, cols, rect, &window, &mtx] (int linearix)
{
// Determine what indices to use.
const auto r = static_cast<int>(linearix / cols);
const auto c = static_cast<int>(linearix % cols);
const auto r0 = r + rect.top();
const auto c0 = c + rect.left();
const auto c1 = std::max(c + rect.left() - 1, 0);
const auto r1 = r0;
const auto r2 = std::max(r + rect.top() - 1, 0);
const auto c2 = c + rect.left();
// Lock critical section to access shared memory pixels in "normals"
mtx.lock();
const auto ninit = normals.getpel(r0, c0).asArray();
const auto npx = normals.getpel(r1, c1).asArray();
const auto npy = normals.getpel(r2, c2).asArray();
mtx.unlock();
// Do heavy duty math on these pixels. I've left out the code but
// no locking of any kind is done. Just math on local data.
// ... blah blah blah
// Lock again to set the corrected pixel in shared memory
mtx.lock();
normals.setpel(
r + rect.top(),
c + rect.left(),
NormalVector(ntemp[0], ntemp[1], ntemp[2]));
// Unlock one final time.
mtx.unlock();
};
// Now call the parallel_for loop with the lambda above.
// This version causes infinite thread creation
concurrency::parallel_for(0, (int)totalix, op);
// This version works but performs much slower with the
// concurrency::critical_section than with std::mutex or
// Windows CRITICAL_SECTION
// concurrency::parallel_for(0, (int)totalix, op, concurrency::static_partitioner());
}
A few things I've checked:
concurrency::critical_section
is not a recursive lock but neither is std::mutex and that works just fine). Upvotes: 1
Views: 507