cplusplusuberalles
cplusplusuberalles

Reputation: 199

What is _kmp_fork_barrier and how to see if there is load imbalance?

I'm using Intel VTune Amplifier to see how my parallel application scales.

Notice I don't use any explicit lock mechanism

It scales pretty well on my 4-cores laptop (considering that there are portions of the algorithm that can't be parallelized):

enter image description here

However, when I test it on the Knights Landing (KNL), it scales horribly:

enter image description here

Notice that I'm using only 64 cores on purpose (speaking of which, if you're interested on thread affinity I've opened another question on the topic).

Why there is so much idle time? And what is _kmp_fork_barrier? Reading about "Imbalance or Serial Spinning (OpenMP)" it seems that this is about load imbalance, but I'm already using schedule(dynamic,1) in all omp regions.

How can I see if this is actually load imbalance? Otherwise, what could be a possible cause?

Notice I have 3 parallel omp parallel regions:

#pragma omp parallel for collapse(2) schedule(dynamic,1)

#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)

#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)

This is the bottom-up section:

enter image description here

Is it possible that this is because of the reduction? I knew that it was pretty efficient (using a divide-et-impere merge approach).

See here how the most expensive functions are well parallelized (most of them):

enter image description here

Zooming in the spinning section (as requested by commend)enter image description here:

OpenMP histograms as requested in the comments:

The reduction region:

enter image description here

The unkwown region abbout initInterTab2d:

enter image description here

UPDATE:

Building OpenCV with TBB and OpenMP disabled deleted this strange parallel region iniInterTab2D. So this is for sure OpenCV related, but I don't udnerstand how.

Upvotes: 4

Views: 2747

Answers (1)

Jim Cownie
Jim Cownie

Reputation: 2859

You need to learn to use VTune better. It has specific OpenMP analyses which avoid you having to ask about the internals of the OpenMP runtime. Look at https://software.intel.com/en-us/node/544172 and https://software.intel.com/en-us/openmp-analysis-lin for an introduction.

p.s. Using schedule(dynamic,1) everywhere is probably a bad idea.

p.p.s. Before you plot scaling results read my blog about how to to that.

Full disclosure: I work for Intel, sometimes on the OpenMP runtime.

Upvotes: 3

Related Questions