c++ thread creation big overhead

Question

I have the following code, which confuses me a lot:

float OverlapRate(cv::Mat& model, cv::Mat& img) {
    if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
        return 0;
    }

    cv::Mat bgr[3];
    cv::split(img, bgr);

    int counter = 0;
    float b_average = 0, g_average = 0, r_average = 0;
    for (int i = 0; i < model.rows; i++) {
        for (int j = 0; j < model.cols; j++) {
            if((model.at(i,j)==255)){
                counter++;
                b_average += bgr[0].at(i, j);
                g_average += bgr[1].at(i, j);
                r_average += bgr[2].at(i, j);
            }
        }
    }

    b_average = b_average / counter;
    g_average = g_average / counter;
    r_average = r_average / counter;

    counter = 0;
    float b_stde = 0, g_stde = 0, r_stde = 0;
    for (int i = 0; i < model.rows; i++) {
        for (int j = 0; j < model.cols; j++) {
            if((model.at(i,j)==255)){
                counter++;
                b_stde += std::pow((bgr[0].at(i, j) - b_average), 2); 
                g_stde += std::pow((bgr[1].at(i, j) - g_average), 2); 
                r_stde += std::pow((bgr[2].at(i, j) - r_average), 2);                 
            }
        }
    }

    b_stde = std::sqrt(b_stde / counter);
    g_stde = std::sqrt(g_stde / counter);
    r_stde = std::sqrt(r_stde / counter);

    return (b_stde + g_stde + r_stde) / 3;
}


void work(cv::Mat& model, cv::Mat& img, int index, std::map& results){
    results[index] = OverlapRate(model, img);
}

int OCR(cv::Mat& a, std::map& b, const std::vector& possible_values)
{
        int recog_value = -1;
        clock_t start = clock(); 

        std::thread threads[10];
        std::map results;
        for(int i=0; i<10; i++)
        {
            threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
        }

        for(int i=0; i<10; i++)
            threads[i].join();


        float min_score = 1000;
        int min_index = -1;
        for(auto& it:results)
        {
            if (it.second < min_score) {
                min_score = it.second;
                min_index = it.first;
            }
        }

        clock_t end = clock();
        clock_t t = end - start;
        printf ("It took me %d clicks (%f seconds) .
",t,((float)t)/CLOCKS_PER_SEC);

        recog_value = min_index;
}

What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.

When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.

What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:

threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));

Why? Thanks.

Mark Lakata · Accepted Answer

Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).

Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.

Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.

(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)

c++ thread creation big overhead

Answers (2)

Related Questions