Inefficient fibonacci series using intel TBB much slower than non-threaded implementation

Question

It came out as a surprise when my parallelized version of fibonacci implementation(inefficient, just to compare the performance of the library) was much much slower than normal inefficient implementation, even after using all the 8 logical cores of my i7-6700HQ processor. Processor fans started going haywire the processing time was very slow compared to non-parallel implementation.

The example is directly from the TBB tutorial on intel - https://www.threadingbuildingblocks.org/tutorial-intel-tbb-task-based-programming

Here is my code

#include 
#include 
#include 

#define FIB_NUM 40

long fib1(int n)
{
    if(n < 2) return n;

    else
    {
        int x, y;
        tbb::task_group g;
        g.run([&]{x=fib1(n - 1);});
        g.run([&]{y=fib1(n - 2);});
        g.wait();
        return x + y;
    }
}

long fib2(int n)
{
    return n < 2? n : fib2(n - 1) + fib2(n - 2);
}

int main()
{
    auto t1 = std::chrono::high_resolution_clock::now();
    std::cout << fib2(FIB_NUM) << std::endl;
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << (t2 - t1).count() << std::endl;
    t1 = std::chrono::high_resolution_clock::now();
    std::cout << fib1(FIB_NUM) << std::endl;
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << (t2 - t1).count() << std::endl;
}

I dont know what I am doing wrong. It would be helpful if someone could point it out.

Thanks

Alex · Accepted Answer

The main problem of the example is small tasks. The leaf tasks (n<2) just calculate return n. Undoubtedly, it is inefficient for parallelism. The example can be improved with "cutoff" condition when the sub-problem is considered to be too small for parallelization. Let's suppose that there is no sense to calculate the first 12 fibonacci numbers in parallel and we will use a serial implementation instead:

long fib1(int n)
{
    // Use a serial implementation for "small" numbers.
    if(n < 12) return fib2(n);
    else
    {
        int x, y;
        tbb::task_group g;
        g.run([&]{x=fib1(n - 1);});
        g.run([&]{y=fib1(n - 2);});
        g.wait();
        return x + y;
    }
}

Perhaps, you want to read articles about Divide and Conquer and The Task Scheduler.

P.S. Intel TBB uses task based approach for parallelism. The method tbb::task_group::run creates a task (not a thread) that is executed when one of threads from the thread pool is available. So it does not matter how many tasks in the system - the number of threads is always limited.

Inefficient fibonacci series using intel TBB much slower than non-threaded implementation

Answers (1)

Related Questions