Reputation: 165
I'm working on visual detection of objects and I use the cascade classifier from Opencv. It works well but it's too slow for me. I used Vtune to get all hotspots and I find that on 140s of execution (CPU time, real was about 60s), there is 123s of overhead time. cvCascadeClassifier use TBB to be faster but it seems that all TBB threads wait more than they should. There is the code :
void operator()(const Range& range) const
{
Ptr<FeatureEvaluator> evaluator = classifier->featureEvaluator->clone();
Size winSize(cvRound(classifier->data.origWinSize.width * scalingFactor), cvRound(classifier->data.origWinSize.height * scalingFactor));
int y1 = range.start * stripSize;
int y2 = min(range.end * stripSize, processingRectSize.height);
for( int y = y1; y < y2; y += yStep )
{
for( int x = 0; x < processingRectSize.width; x += yStep )
{
if ( (!mask.empty()) && (mask.at<uchar>(Point(x,y))==0)) {
continue;
}
double gypWeight;
int result = classifier->runAt(evaluator, Point(x, y), gypWeight);
#if defined (LOG_CASCADE_STATISTIC)
logger.setPoint(Point(x, y), result);
#endif
if( rejectLevels )
{
if( result == 1 )
result = -(int)classifier->data.stages.size();
if( classifier->data.stages.size() + result < 4 )
{
mtx->lock();
rectangles->push_back(Rect(cvRound(x*scalingFactor), cvRound(y*scalingFactor), winSize.width, winSize.height));
rejectLevels->push_back(-result);
levelWeights->push_back(gypWeight);
mtx->unlock();
}
}
else if( result > 0 )
{
mtx->lock();
rectangles->push_back(Rect(cvRound(x*scalingFactor), cvRound(y*scalingFactor),
winSize.width, winSize.height));
mtx->unlock();
}
if( result == 0 )
x += yStep;
}
}
}
I think the problem come from the merge of results. There is too much mutex locking and threads need to wait too often. This part of code is called a lot of time and there is very few threads (3 in my case). I tried to create local vectors (I didn't try with list because Rect type is very small) for each thread and merge all those vectors at the end. This solution reduces the overhead time (less than 10s on 140s of CPU time) but I would like more.
This is my question : Is there a way to merge results from different TBB thread efficiently (aka reduce the overhead time)?
EDIT : In my case, I found an error during the linking. Create local vectors and merge at the end with mutex works well. Now, I have 0.1s of overhead on 140s of CPU time. It's a particular case with few element which are very small. Anton answer seems to be more generic
Upvotes: 2
Views: 499
Reputation: 6587
There is another and perhaps more efficient way of combining the results. Use combinable or "ets" classes in order to collect .local()
results for each thread/task (don't use threads directly) and then, combine the results using .combine()
Upvotes: 3
Reputation: 353
You can try to use TBB concurrent_vector. A grow_by interface can help you to reduce overheads on insertion: you can create small(e.g. 16 elem) array on stack and merge all elements from it in concurrent_vector.
Also you can replace push_back
by emplace_back
with C++11 powered with concurrent_vector.
Upvotes: 2