Herbert
Herbert

Reputation: 5645

Fast and generic use of lambda functions

Motivation

I created a header file which wraps Matlab's mex functionality in c++11 classes; especially for MxNxC images. Two functions I created are forEach, which iterates over each pixel in the image, and also a forKernel, which given a kernel and pixel in the image, iterates over the kernel around that pixel, handling all kinds of nifty, boiler-plate indexing mathematics.

The idea is that one could program sliding-windows like this:

image.forEach([](Image &image, size_t row, size_t col) {
  //kr and lc specify which pixel is the center of the kernel
  image.forKernel<double>(row, col, kernel, kr, kc, [](Image &image, double w, size_t row, size_t col) {
    // w is the weight/coefficient of the kernel, row/col are the corresponding coordinates in the image.
    // process ...
  });
});

Problem

This provides a nice way to

The latter point is the problem, of course. I was hoping g++ would be able to optimize the lambda-functions out and inline all the code. This does not happen. Hence I created a minimal working example on 1D data:

#include <iostream>
#include <functional>

struct Data {
  size_t d_size;
  double *d_data;
  Data(size_t size) : d_size(size), d_data(new double[size]) {}
  ~Data() { delete[] d_data; }
  double &operator[](size_t i) { return d_data[i]; }


  inline void forEach(std::function<void(Data &, size_t)> f) {
    for (size_t index = 0; index != d_size; ++index)
      f(*this, index);
  }
};



int main() {
  Data im(50000000);
  im.forEach([](Data &im, size_t i) {
    im[i] = static_cast<double>(i);
  });

  double sum = 0;
  im.forEach([&sum](Data &im, size_t i) {
    sum += im[i];
  });

  std::cout << sum << '\n';
}

source: http://ideone.com/hviTwx

I'm guessing the compiler is not able to compile the code for forEach per lambda-function, as the lambda function is not a template variable. The good thing is that one can compile once and link to it more often with different lambda functions, but the bad thing is that it is slow.

Moreover, the situation discussed in the motivation already contains templates for the data type (double, int, ...), hence the 'good thing' is overruled anyway.

A fast way to implement the previous would be like this:

#include <iostream>
#include <functional>

struct Data {
  size_t d_size;
  double *d_data;
  Data(size_t size) : d_size(size), d_data(new double[size]) {}
  ~Data() { delete[] d_data; }
  double &operator[](size_t i) { return d_data[i]; }
};



int main() {
  size_t len = 50000000;
  Data im(len);
  for (size_t index = 0; index != len; ++index)
    im[index] = static_cast<double>(index);

  double sum = 0;
  for (size_t index = 0; index != len; ++index)
    sum += im[index];

  std::cout << sum << '\n';
}

source: http://ideone.com/UajMMz

It is about 8x faster, but also less readable, especially when we consider more complicated structures like images with kernels.

Question

Is there a way to provide the lambda function as a template argument, such that forEach is compiled for each call, and optimized for each specific instance of the lambda function? Can the lambda function be inlined somehow, since lambda functions are typically not recursive this should be trivial, but what is the syntax?

I found some related posts:

But they do not give a solution in the form of a minimal working example, and they do not discuss the possibility of inlining a lambda function. The answer to my question should do that: change the Data.forEach member function and it's call such that is as fast as possible / allows for as many running time optimizations (not optimizations at run time, but at compile time that decrease runtime) as possible.

Regarding the suggestion of forEveR

Thank you for creating that fix, it's a huge improvement yet still approximately 2x as slow:

Results:

herbert@machine ~ $ g++ -std=c++11 -Wall test0.cc -o test0
herbert@machine ~ $ g++ -std=c++11 -Wall test1.cc -o test1
herbert@machine ~ $ g++ -std=c++11 -Wall test2.cc -o test2
herbert@machine ~ $ time ./test0
1.25e+15

real    0m2.563s
user    0m2.541s
sys     0m0.024s
herbert@machine ~ $ time ./test1
1.25e+15

real    0m0.346s
user    0m0.320s
sys     0m0.026s
herbert@machine ~ $ time ./test2
1.25e+15

real    0m0.601s
user    0m0.575s
sys     0m0.026s
herbert@machine ~ $ 

I re-ran the code with -O2, which fixes the problem. runtimes of test1 and test2 ar now very similar. Thank you @stijn and @forEveR.

herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test0.cc -o test0
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test1.cc -o test1
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test2.cc -o test2

herbert@machine ~ $ time ./test0
1.25e+15

real    0m0.256s
user    0m0.229s
sys 0m0.028s
herbert@machine ~ $ time ./test1
1.25e+15

real    0m0.111s
user    0m0.078s
sys 0m0.033s
herbert@machine ~ $ time ./test2
1.25e+15

real    0m0.108s
user    0m0.076s
sys 0m0.032s
herbert@machine ~ $ 

Upvotes: 4

Views: 427

Answers (1)

ForEveR
ForEveR

Reputation: 55897

Problem is, that you use std::function, that actually use type-erasure and virtual calls. You can simply use template parameter, instead of std::function. Call of lambda function will be inlined, due n3376 5.1.2/5

The closure type for a lambda-expression has a public inline function call operator (13.5.4) whose param- eters and return type are described by the lambda-expression’s parameter-declaration-clause and trailing- return-type respectively

So, just simply write

  template<typename Function>
  inline void forEach(Function f) {
    for (size_t index = 0; index != d_size; ++index)
      f(*this, index);
  }

Live example

Upvotes: 6

Related Questions