Reputation: 333
I wonder why i get unexpected performances with these two pairs of obvious examples of recursion.
The same recursive function is faster inside a struct (rec2 VS rec1) and the same recursive template function is faster with a dummy parameter (rec4 VS rec3) !
Are the C++ functions faster with more arguments ?!
Here is the code tried :
#include <QDebug>
#include <QElapsedTimer>
constexpr std::size_t N = 28;
std::size_t counter = 0;
// non template function which take 1 argument
void rec1(std::size_t depth)
{
++counter;
if ( depth < N )
{
rec1(depth + 1);
rec1(depth + 1);
}
}
// non template member which take 2 arguments (implicit this)
struct A
{
void rec2(std::size_t depth)
{
++counter;
if ( depth < N )
{
rec2(depth + 1);
rec2(depth + 1);
}
}
};
// template function which take 0 argument
template <std::size_t D>
void rec3()
{
++counter;
rec3<D - 1>();
rec3<D - 1>();
}
template <>
void rec3<0>()
{
++counter;
}
// template function which take 1 (dummy) argument
struct Foo
{
int x;
};
template <std::size_t D>
void rec4(Foo x)
{
++counter;
rec4<D - 1>(x);
rec4<D - 1>(x);
}
template <>
void rec4<0>(Foo x)
{
++counter;
}
int main()
{
QElapsedTimer t;
t.start();
rec1(0);
qDebug() << "time1" << t.elapsed();
qDebug() << "counter" << counter;
counter = 0;
A a;
t.start();
a.rec2(0);
qDebug()<< "time2" << t.elapsed();
qDebug()<< "counter" << counter;
counter = 0;
t.start();
rec3<N>();
qDebug()<< "time3" << t.elapsed();
qDebug()<< "counter" << counter;
counter = 0;
t.start();
rec4<N>(Foo());
qDebug()<< "time4" << t.elapsed();
qDebug()<< "counter" << counter;
qDebug() << "fin";
return 0;
}
I get this output :
time1 976
counter 536870911
time2 341
counter 536870911
time3 241
counter 536870911
time4 201
counter 536870911
fin
I have : Windows 8.1/i7 3630QM/latest Qt chaintool/c++14 enabled
Upvotes: 2
Views: 108
Reputation: 32732
I was finally able to look at this on Visual Studio 2015 Community. Examining the disassembly of the compiled code, rec1 and rec2 are recursive. They are very similar in the generated code, and although rec2 has more instructions it runs slightly faster. rec3 and rec4 both generate a series of functions for all the different values of D in the template parameter, and in this case the compiler has eliminated many of the function calls, eliminated others, and added a larger value to count. (For example, rec4<10> just adds 2047 to count and returns.)
So the performance difference you see is mostly due to how the compiler optimizes each version, with slight differences in how the code flows thru the CPU also a factor.
My results (time measured in seconds), compiled with /Ox /O2:
time1 1.03411
counter 536870911
time2 0.970455
counter 536870911
time3 0.000866
counter 536870911
time4 0.000804
counter 536870911
Upvotes: 1