Stored lambda function calls are very slow - fix or workaround?

Question

In an attempt to make a more usable version of the code I wrote for an answer to another question, I used a lambda function to process an individual unit. This is a work in progress. I've got the "client" syntax looking pretty nice:

// for loop split into 4 threads, calling doThing for each index
parloop(4, 0, 100000000, [](int i) { doThing(i); });

However, I have an issue. Whenever I call the saved lambda, it takes up a ton of CPU time. doThing itself is an empty stub. If I just comment out the internal call to the lambda, then the speed returns to normal (4 times speedup for 4 threads). I'm using std::function to save the reference to the lambda.

My question is - Is there some better way that the stl library internally manages lambdas for large sets of data, that I haven't come across?

struct parloop
{
public:
    std::vector myThreads;
    int numThreads, rangeStart, rangeEnd;
    std::function lambda;

    parloop(int _numThreads, int _rangeStart, int _rangeEnd, std::function _lambda) //
        : numThreads(_numThreads), rangeStart(_rangeStart), rangeEnd(_rangeEnd), lambda(_lambda) //
    {
        init();
        exit();
    }

    void init()
    {
        myThreads.resize(numThreads);

        for (int i = 0; i < numThreads; ++i)
        {
            myThreads[i] = std::thread(myThreadFunction, this, chunkStart(i), chunkEnd(i));
        }
    }

    void exit()
    {
        for (int i = 0; i < numThreads; ++i)
        {
            myThreads[i].join();
        }
    }

    int rangeJump()
    {
        return ceil(float(rangeEnd - rangeStart) / float(numThreads));
    }

    int chunkStart(int i)
    {
        return rangeJump() * i;
    }

    int chunkEnd(int i)
    {
        return std::min(rangeJump() * (i + 1) - 1, rangeEnd);
    }

    static void myThreadFunction(parloop *self, int start, int end) //
    {
        std::function lambda = self->lambda;
        // we're just going to loop through the numbers and print them out
        for (int i = start; i <= end; ++i)
        {
            lambda(i); // commenting this out speeds things up back to normal
        }
    }

};

void doThing(int i) // "payload" of the lambda function
{
}

int main()
{
    auto start = timer.now();
    auto stop = timer.now();


    // run 4 trials of each number of threads
    for (int x = 1; x <= 4; ++x)
    {
        // test between 1-8 threads
        for (int numThreads = 1; numThreads <= 8; ++numThreads)
        {
            start = timer.now();

            // this is the line of code which calls doThing in the loop

            parloop(numThreads, 0, 100000000, [](int i) { doThing(i); });

            stop = timer.now();

            cout << numThreads << " Time = " << std::chrono::duration_cast(stop - start).count() / 1000000.0f << " ms
";
            //cout << "		simple list, time was " << deltaTime2 / 1000000.0f << " ms
";
        }
    }

    cin.ignore();
    cin.get();
    return 0;
}

Vittorio Romeo · Accepted Answer

I'm using std::function to save the reference to the lambda.

That's one possible problem, as std::function is not a zero-runtime-cost abstraction. It is a type-erased wrapper that has a virtual-call like cost when invoking operator() and could also potentially heap-allocate (which could mean a cache-miss per call).

If you want to store your lambda in such a way that does not introduce additional overhead and that allows the compiler to inline it, you should use a template parameter. This is not always possible, but might fit your use case. Example:

template 
struct parloop
{
public:
    std::thread **myThreads;
    int numThreads, rangeStart, rangeEnd;
    TFunction lambda;

parloop(TFunction&& _lambda, 
        int _numThreads, int _rangeStart, int _rangeEnd)
    : lambda(std::move(_lambda)), 
      numThreads(_numThreads), rangeStart(_rangeStart), 
      rangeEnd(_rangeEnd) 
{
    init();
    exit();
}

// ...

To deduce the type of the lambda, you can use an helper function:

template 
auto make_parloop(TF&& lambda, TArgs&&... xs)
{
    return parloop>(
        std::forward(lambda), std::forward(xs)...);
}

Usage:

auto p = make_parloop([](int i) { doThing(i); }, 
                      numThreads, 0, 100000000);

I wrote an article that's related to the subject:
"Passing functions to functions"

It contains some benchmarks that show how much assembly is generated for std::function compared to a template parameter and other solutions.

Stored lambda function calls are very slow - fix or workaround?

Answers (1)

Related Questions