C++11 parallel for implementation

Question

First of all I made a thread pool, and tried to do some heavy arithmetic operations on an array with a size of 40960 float elements.

A single-threaded approach got a result of 0.0009 second, while a paralleled approach with 4 threads running synchronously got 0.0003 second. In this implementation, I manually distributed the task into 4 parts and queued them into the thread pool.

Now I want to provide a general method parfor for my thread pool. I tried this:

    void parfor(int begin, int end, std::functionfunc)
    {
        int delta = (end - begin) / M_count;
        for (int i = 0; i < M_count; ++i)
            queue([=]{
                int localbegin = begin + i*delta;
                int localend = (i == M_count - 1) ? end : localbegin + delta;
                for (int it = localbegin; it < localend; ++it)
                    func(it);
            });
        wait();
    }

Where M_count is the number of threads. And the execution time becomes 0.003 sec (about 10 times of the one with its job manually distributed). I guess std::function have a great runtime overhead but don't know any other alternative approach. Could you give me some advice? Many thanks.

Edit: According to Rapptz's advice, I tried this:

template 
void parfor(int begin, int end, Function)

And used it like this:

pool.parfor(0, 40960, [&](int i){
    buff[i] = pow5(buff[i]);
});

It shows some errors:

error C2371: 'it' : redefinition; different basic types 
error C2512: 'wmain::' : no appropriate default constructor available

I think it treats the lambda as a type but don't know how to solve it...

Tony Delroy · Accepted Answer

(Too much for a comment...) This is just explaining how to implement Rapptz's suggestion of using a template parameter to specify the function (so it can be inlined).

Consider the following code:

#include 

void f(int n) { std::cout << "f(" << n << ");
"; }
void g(int n) { std::cout << "g(" << n << ");
"; }

template 
void t(Function function, int n)
{
    static int x;
    std::cout << "&x " << &x << '
';
    function(n);
}

struct FuncF { static void f(int n) { std::cout << "Ff(" << n << ");
"; } };
struct FuncG { static void f(int n) { std::cout << "Gf(" << n << ");
"; } };

template 
void ft(int n)
{
    static int x;
    std::cout << "&x " << &x << '
';
    Function::f(n);
}

int main()
{
    t(f, 42);
    t(g, 42);

    ft(42);
    ft(42);
}

This prints something like:

&x 00421760
f(42);
&x 00421760
g(42);
&x 00421764
Ff(42);
&x 00421768
Gf(42);

Note that the first two print the same address for x... that's because only one instantiation of the template is needed as the function type is the same for both calls. The ft template uses the template parameter to access the function without having a run-time function argument involved, so there are two instantiations yielding different addresses for the local static x.

To get your function calls inlined, you should adopt an approach similar to FuncF/FuncG and ft.

C++11 parallel for implementation

Answers (1)

Related Questions