How to efficiently merge results from TBB threads

Question

I'm working on visual detection of objects and I use the cascade classifier from Opencv. It works well but it's too slow for me. I used Vtune to get all hotspots and I find that on 140s of execution (CPU time, real was about 60s), there is 123s of overhead time. cvCascadeClassifier use TBB to be faster but it seems that all TBB threads wait more than they should. There is the code :

void operator()(const Range& range) const
{
    Ptr evaluator = classifier->featureEvaluator->clone();

    Size winSize(cvRound(classifier->data.origWinSize.width * scalingFactor), cvRound(classifier->data.origWinSize.height * scalingFactor));

    int y1 = range.start * stripSize;
    int y2 = min(range.end * stripSize, processingRectSize.height);
    for( int y = y1; y < y2; y += yStep )
    {
        for( int x = 0; x < processingRectSize.width; x += yStep )
        {
            if ( (!mask.empty()) && (mask.at(Point(x,y))==0)) {
                continue;
            }

            double gypWeight;
            int result = classifier->runAt(evaluator, Point(x, y), gypWeight);

            #if defined (LOG_CASCADE_STATISTIC)
            logger.setPoint(Point(x, y), result);
            #endif
            if( rejectLevels )
            {
                if( result == 1 )
                    result =  -(int)classifier->data.stages.size();
                if( classifier->data.stages.size() + result < 4 )
                {
                    mtx->lock();
                    rectangles->push_back(Rect(cvRound(x*scalingFactor), cvRound(y*scalingFactor), winSize.width, winSize.height));
                    rejectLevels->push_back(-result);
                    levelWeights->push_back(gypWeight);
                    mtx->unlock();
                }
            }
            else if( result > 0 )
            {
                mtx->lock();
                rectangles->push_back(Rect(cvRound(x*scalingFactor), cvRound(y*scalingFactor),
                                           winSize.width, winSize.height));
                mtx->unlock();
            }
            if( result == 0 )
                x += yStep;
        }
    }
}

I think the problem come from the merge of results. There is too much mutex locking and threads need to wait too often. This part of code is called a lot of time and there is very few threads (3 in my case). I tried to create local vectors (I didn't try with list because Rect type is very small) for each thread and merge all those vectors at the end. This solution reduces the overhead time (less than 10s on 140s of CPU time) but I would like more.

This is my question : Is there a way to merge results from different TBB thread efficiently (aka reduce the overhead time)?

EDIT : In my case, I found an error during the linking. Create local vectors and merge at the end with mutex works well. Now, I have 0.1s of overhead on 140s of CPU time. It's a particular case with few element which are very small. Anton answer seems to be more generic

Anton · Accepted Answer

There is another and perhaps more efficient way of combining the results. Use combinable or "ets" classes in order to collect .local() results for each thread/task (don't use threads directly) and then, combine the results using .combine()

How to efficiently merge results from TBB threads

Answers (2)

Related Questions