What is _kmp_fork_barrier and how to see if there is load imbalance?

Question

I'm using Intel VTune Amplifier to see how my parallel application scales.

Notice I don't use any explicit lock mechanism

It scales pretty well on my 4-cores laptop (considering that there are portions of the algorithm that can't be parallelized):

However, when I test it on the Knights Landing (KNL), it scales horribly:

Notice that I'm using only 64 cores on purpose (speaking of which, if you're interested on thread affinity I've opened another question on the topic).

Why there is so much idle time? And what is _kmp_fork_barrier? Reading about "Imbalance or Serial Spinning (OpenMP)" it seems that this is about load imbalance, but I'm already using schedule(dynamic,1) in all omp regions.

How can I see if this is actually load imbalance? Otherwise, what could be a possible cause?

Notice I have 3 parallel omp parallel regions:

#pragma omp parallel for collapse(2) schedule(dynamic,1)

#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)

#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)

This is the bottom-up section:

Is it possible that this is because of the reduction? I knew that it was pretty efficient (using a divide-et-impere merge approach).

See here how the most expensive functions are well parallelized (most of them):

Zooming in the spinning section (as requested by commend):

OpenMP histograms as requested in the comments:

The reduction region:

The unkwown region abbout initInterTab2d:

UPDATE:

Building OpenCV with TBB and OpenMP disabled deleted this strange parallel region iniInterTab2D. So this is for sure OpenCV related, but I don't udnerstand how.

What is _kmp_fork_barrier and how to see if there is load imbalance?

Answers (1)

Related Questions