uiqbal
uiqbal

Reputation: 111

TBB parallel_for with less number of threads

I have written a multi-view face detection code using opencv face detector. I am running five detectors (trained for different pose angles) over an image and taking their weights to detect faces in an image. I have made the code parallel using TBB parallel_for but it improved the performance by just 1.7-times. I would like to ask if there is any better way to run five detectors in parallel?

I am running my code on a cluster with 16-cores. I think number of threads (that in my case are 5) are too less to utilize the complete power.

Any suggestions?

Thanks,

Upvotes: 0

Views: 822

Answers (1)

Arch D. Robison
Arch D. Robison

Reputation: 4049

Some possible problems to look into:

  • One of the detectors takes longer than the other detectors to run. For example, if one detector takes 4 units of time, and the other four detectors each take 1 unit of time, the most possible speedup is 2x. Parallelizing the slow detector itself might help in this kind of situation.
  • The detectors run so fast that the parallel_for does not have time to spread the work. If the detectors each take at least 0.1 sec, this should not be a problem.
  • Memory bandwidth can be a limiting resource, particularly if working sets do not fit in outer-level cache.

A profiler such as Intel(R) VTune(TM) Amplifier can sometimes help to track down these problems. Both commercial and non-commercial licenses exist for Amplifier. [Disclaimer: I work for Intel.]

Upvotes: 1

Related Questions