Reputation: 199
I'm using Intel VTune Amplifier to see how my parallel application scales.
Notice I don't use any explicit lock mechanism
It scales pretty well on my 4-cores laptop (considering that there are portions of the algorithm that can't be parallelized):
However, when I test it on the Knights Landing (KNL), it scales horribly:
Notice that I'm using only 64 cores on purpose (speaking of which, if you're interested on thread affinity I've opened another question on the topic).
Why there is so much idle time? And what is _kmp_fork_barrier
? Reading about "Imbalance or Serial Spinning (OpenMP)" it seems that this is about load imbalance, but I'm already using schedule(dynamic,1)
in all omp
regions.
How can I see if this is actually load imbalance? Otherwise, what could be a possible cause?
Notice I have 3 parallel omp parallel regions:
#pragma omp parallel for collapse(2) schedule(dynamic,1)
#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)
#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)
This is the bottom-up section:
Is it possible that this is because of the reduction
? I knew that it was pretty efficient (using a divide-et-impere merge approach).
See here how the most expensive functions are well parallelized (most of them):
Zooming in the spinning section (as requested by commend):
OpenMP histograms as requested in the comments:
The reduction region:
The unkwown region abbout initInterTab2d
:
UPDATE:
Building OpenCV with TBB and OpenMP disabled deleted this strange parallel region iniInterTab2D
. So this is for sure OpenCV related, but I don't udnerstand how.
Upvotes: 4
Views: 2747
Reputation: 2859
You need to learn to use VTune better. It has specific OpenMP analyses which avoid you having to ask about the internals of the OpenMP runtime. Look at https://software.intel.com/en-us/node/544172 and https://software.intel.com/en-us/openmp-analysis-lin for an introduction.
p.s. Using schedule(dynamic,1)
everywhere is probably a bad idea.
p.p.s. Before you plot scaling results read my blog about how to to that.
Full disclosure: I work for Intel, sometimes on the OpenMP runtime.
Upvotes: 3