Reputation: 5645
I created a header file which wraps Matlab's mex functionality in c++11 classes; especially for MxNxC images. Two functions I created are forEach, which iterates over each pixel in the image, and also a forKernel, which given a kernel and pixel in the image, iterates over the kernel around that pixel, handling all kinds of nifty, boiler-plate indexing mathematics.
The idea is that one could program sliding-windows like this:
image.forEach([](Image &image, size_t row, size_t col) {
//kr and lc specify which pixel is the center of the kernel
image.forKernel<double>(row, col, kernel, kr, kc, [](Image &image, double w, size_t row, size_t col) {
// w is the weight/coefficient of the kernel, row/col are the corresponding coordinates in the image.
// process ...
});
});
This provides a nice way to
The latter point is the problem, of course. I was hoping g++ would be able to optimize the lambda-functions out and inline all the code. This does not happen. Hence I created a minimal working example on 1D data:
#include <iostream>
#include <functional>
struct Data {
size_t d_size;
double *d_data;
Data(size_t size) : d_size(size), d_data(new double[size]) {}
~Data() { delete[] d_data; }
double &operator[](size_t i) { return d_data[i]; }
inline void forEach(std::function<void(Data &, size_t)> f) {
for (size_t index = 0; index != d_size; ++index)
f(*this, index);
}
};
int main() {
Data im(50000000);
im.forEach([](Data &im, size_t i) {
im[i] = static_cast<double>(i);
});
double sum = 0;
im.forEach([&sum](Data &im, size_t i) {
sum += im[i];
});
std::cout << sum << '\n';
}
source: http://ideone.com/hviTwx
I'm guessing the compiler is not able to compile the code for forEach per lambda-function, as the lambda function is not a template variable. The good thing is that one can compile once and link to it more often with different lambda functions, but the bad thing is that it is slow.
Moreover, the situation discussed in the motivation already contains templates for the data type (double, int, ...), hence the 'good thing' is overruled anyway.
A fast way to implement the previous would be like this:
#include <iostream>
#include <functional>
struct Data {
size_t d_size;
double *d_data;
Data(size_t size) : d_size(size), d_data(new double[size]) {}
~Data() { delete[] d_data; }
double &operator[](size_t i) { return d_data[i]; }
};
int main() {
size_t len = 50000000;
Data im(len);
for (size_t index = 0; index != len; ++index)
im[index] = static_cast<double>(index);
double sum = 0;
for (size_t index = 0; index != len; ++index)
sum += im[index];
std::cout << sum << '\n';
}
source: http://ideone.com/UajMMz
It is about 8x faster, but also less readable, especially when we consider more complicated structures like images with kernels.
Is there a way to provide the lambda function as a template argument, such that forEach is compiled for each call, and optimized for each specific instance of the lambda function? Can the lambda function be inlined somehow, since lambda functions are typically not recursive this should be trivial, but what is the syntax?
I found some related posts:
But they do not give a solution in the form of a minimal working example, and they do not discuss the possibility of inlining a lambda function. The answer to my question should do that: change the Data.forEach member function and it's call such that is as fast as possible / allows for as many running time optimizations (not optimizations at run time, but at compile time that decrease runtime) as possible.
Thank you for creating that fix, it's a huge improvement yet still approximately 2x as slow:
Results:
herbert@machine ~ $ g++ -std=c++11 -Wall test0.cc -o test0
herbert@machine ~ $ g++ -std=c++11 -Wall test1.cc -o test1
herbert@machine ~ $ g++ -std=c++11 -Wall test2.cc -o test2
herbert@machine ~ $ time ./test0
1.25e+15
real 0m2.563s
user 0m2.541s
sys 0m0.024s
herbert@machine ~ $ time ./test1
1.25e+15
real 0m0.346s
user 0m0.320s
sys 0m0.026s
herbert@machine ~ $ time ./test2
1.25e+15
real 0m0.601s
user 0m0.575s
sys 0m0.026s
herbert@machine ~ $
I re-ran the code with -O2, which fixes the problem. runtimes of test1 and test2 ar now very similar. Thank you @stijn and @forEveR.
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test0.cc -o test0
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test1.cc -o test1
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test2.cc -o test2
herbert@machine ~ $ time ./test0
1.25e+15
real 0m0.256s
user 0m0.229s
sys 0m0.028s
herbert@machine ~ $ time ./test1
1.25e+15
real 0m0.111s
user 0m0.078s
sys 0m0.033s
herbert@machine ~ $ time ./test2
1.25e+15
real 0m0.108s
user 0m0.076s
sys 0m0.032s
herbert@machine ~ $
Upvotes: 4
Views: 427
Reputation: 55897
Problem is, that you use std::function
, that actually use type-erasure
and virtual calls.
You can simply use template parameter, instead of std::function
. Call of lambda function will be inlined, due n3376 5.1.2/5
The closure type for a lambda-expression has a public inline function call operator (13.5.4) whose param- eters and return type are described by the lambda-expression’s parameter-declaration-clause and trailing- return-type respectively
So, just simply write
template<typename Function>
inline void forEach(Function f) {
for (size_t index = 0; index != d_size; ++index)
f(*this, index);
}
Upvotes: 6