Johnnylin
Johnnylin

Reputation: 535

c++ thread creation big overhead

I have the following code, which confuses me a lot:

float OverlapRate(cv::Mat& model, cv::Mat& img) {
    if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
        return 0;
    }

    cv::Mat bgr[3];
    cv::split(img, bgr);

    int counter = 0;
    float b_average = 0, g_average = 0, r_average = 0;
    for (int i = 0; i < model.rows; i++) {
        for (int j = 0; j < model.cols; j++) {
            if((model.at<uchar>(i,j)==255)){
                counter++;
                b_average += bgr[0].at<uchar>(i, j);
                g_average += bgr[1].at<uchar>(i, j);
                r_average += bgr[2].at<uchar>(i, j);
            }
        }
    }

    b_average = b_average / counter;
    g_average = g_average / counter;
    r_average = r_average / counter;

    counter = 0;
    float b_stde = 0, g_stde = 0, r_stde = 0;
    for (int i = 0; i < model.rows; i++) {
        for (int j = 0; j < model.cols; j++) {
            if((model.at<uchar>(i,j)==255)){
                counter++;
                b_stde += std::pow((bgr[0].at<uchar>(i, j) - b_average), 2); 
                g_stde += std::pow((bgr[1].at<uchar>(i, j) - g_average), 2); 
                r_stde += std::pow((bgr[2].at<uchar>(i, j) - r_average), 2);                 
            }
        }
    }

    b_stde = std::sqrt(b_stde / counter);
    g_stde = std::sqrt(g_stde / counter);
    r_stde = std::sqrt(r_stde / counter);

    return (b_stde + g_stde + r_stde) / 3;
}


void work(cv::Mat& model, cv::Mat& img, int index, std::map<int, float>& results){
    results[index] = OverlapRate(model, img);
}

int OCR(cv::Mat& a, std::map<int,cv::Mat>& b, const std::vector<int>& possible_values)
{
        int recog_value = -1;
        clock_t start = clock(); 

        std::thread threads[10];
        std::map<int, float> results;
        for(int i=0; i<10; i++)
        {
            threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
        }

        for(int i=0; i<10; i++)
            threads[i].join();


        float min_score = 1000;
        int min_index = -1;
        for(auto& it:results)
        {
            if (it.second < min_score) {
                min_score = it.second;
                min_index = it.first;
            }
        }

        clock_t end = clock();
        clock_t t = end - start;
        printf ("It took me %d clicks (%f seconds) .\n",t,((float)t)/CLOCKS_PER_SEC);

        recog_value = min_index;
}

What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.

When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.

What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:

threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));

Why? Thanks.

Upvotes: 0

Views: 1206

Answers (2)

Mark Lakata
Mark Lakata

Reputation: 20838

Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).

Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.

Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.

(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)

Upvotes: 2

OrdinaryNick
OrdinaryNick

Reputation: 248

Not always is efficient to use threads. If you use threads on small problem, then managing threads cost more time and resources then solving the problem. You must have enough work for threads and good managing work over threads.

If you want to know how many threads you can use on problem or how big must be problem, find Isoeffective functions (psi1, psi2, psi3) from theory of parallel computers.

Upvotes: 0

Related Questions