Reputation: 3823
I am checking the difference between two implementations of gradient descent, my guess was that with after compiler optimization both versions of the algorithm would be equivalent.
For my surprise, the recursive version was significantly faster. I haven't discard an actual defect on any of the versions or even in the way I am measuring the time. Can you guys give me some insights please?
This is my code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <stdint.h>
double f(double x)
{
return 2*x;
}
double descgrad(double xo, double xnew, double eps, double precision)
{
// printf("step ... x:%f Xp:%f, delta:%f\n",xo,xnew,fabs(xnew - xo));
if (fabs(xnew - xo) < precision)
{
return xnew;
}
else
{
descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}
}
double descgraditer(double xo, double xnew, double eps, double precision)
{
double Xo = xo;
double Xn = xnew;
while(fabs(Xn-Xo) > precision)
{
//printf("step ... x:%f Xp:%f, delta:%f\n",Xo,Xn,fabs(Xn - Xo));
Xo = Xn;
Xn = Xo - eps * f(Xo);
}
return Xn;
}
int64_t timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
int main()
{
struct timespec s1, e1, s2, e2;
clock_gettime(CLOCK_MONOTONIC, &s1);
printf("Minimum : %f\n",descgraditer(100,99,0.01,0.00001));
clock_gettime(CLOCK_MONOTONIC, &e1);
clock_gettime(CLOCK_MONOTONIC, &s2);
printf("Minimum : %f\n",descgrad(100,99,0.01,0.00001));
clock_gettime(CLOCK_MONOTONIC, &e2);
uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;
printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));
printf("End. \n");
}
I am compiling with gcc 4.5.2 on Ubuntu 11.04 with the following options: gcc grad.c -O3 -lrt -o dg
The output of my code is:
Minimum : 0.000487
Minimum : 0.000487
time_iter:127 ms, time_rec:19 ms, ratio (dif1/dif2) :6.68421
End.
I read a thread which also ask about a recursive version of an algorithm being faster than the iterative one. The explanation over there was that being the recursive version using the stack and the other version using some vectors the access on the heap was slowing down the iterative version. But in this case (in the best of my understanding) I am just using the stack on both cases.
Am I missing something? Anything obvious that I am not seeing? Is my way of measuring time wrong? Any insights?
EDIT: Mystery solved in a comment. As @TonyK said the initialization of the printf was slowing down the first execution. Sorry that I missed that obvious thing.
BTW, The code compiles just right without warnings. I don't think the "return descgrad(.." is necessary since the stop condition is happening before.
Upvotes: 7
Views: 1404
Reputation: 8272
The accepted answer is incorrect.
There IS a difference in the runtimes of the iterative function and the recursive function and the reason is the compiler optimization -foptimize-sibling-calls
added by -O3
.
First, the code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <stdint.h>
double descgrad(double xo, double xnew, double eps, double precision){
if (fabs(xnew - xo) <= precision) {
return xnew;
} else {
return descgrad(xnew, xnew - eps*2*xnew, eps, precision);
}
}
double descgraditer(double xo, double xnew, double eps, double precision){
double Xo = xo;
double Xn = xnew;
while(fabs(Xn-Xo) > precision){
Xo = Xn;
Xn = Xo - eps * 2*Xo;
}
return Xn;
}
int main() {
time_t s1, e1, d1, s2, e2, d2;
int i, iter = 10000000;
double a1, a2;
s1 = time(NULL);
for( i = 0; i < iter; i++ ){
a1 = descgraditer(100,99,0.01,0.00001);
}
e1 = time(NULL);
d1 = difftime( e1, s1 );
s2 = time(NULL);
for( i = 0; i < iter; i++ ){
a2 = descgrad(100,99,0.01,0.00001);
}
e2 = time(NULL);
d2 = difftime( e2, s2 );
printf( "time_iter: %d s, time_rec: %d s, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ;
printf( "return values: %f, %f\n", a1, a2 );
}
Previous posts were correct in pointing out that you need to iterate many times in order to average away environment interference. Given that, I discarded your differencing function in favor of time.h
's difftime
function on time_t
data, since over many iterations, anything finer than a second is meaningless. In addition, I removed the printfs in the benchmark.
I also fixed a bug in the recursive implementation. Your original code's if-statement checked for fabs(xnew-xo) < precision
, which is incorrect (or at least different from the iterative implementation). The iterative loops while fabs() > precision, therefore the recursive function should not recurse when fabs <= precision. Adding 'iteration' counters to both functions confirms that this fix makes the function logically equivalent.
Compiling and running with -O3
:
$ gcc test.c -O3 -lrt -o dg
$ ./dg
time_iter: 34 s, time_rec: 0 s, ratio (iter/rec): inf
return values: 0.000487, 0.000487
Compiling and running without -O3
$ gcc test.c -lrt -o dg
$ ./dg
time_iter: 54 s, time_rec: 90 s, ratio (iter/rec): 0.600000
return values: 0.000487, 0.000487
Under no optimization, iteration performs BETTER than recursion.
Under -O3
optimization, however, recursion runs ten-million iterations in under a second. The reason is that it adds -foptimize-sibling-calls
, which optimizes sibling and tail recursive calls, which is exactly what your recursive function is taking advantage of.
To be sure, I ran it will all -O3
optimizations EXCEPT -foptimize-sibling-calls
:
$ gcc test.c -lrt -o dg -fcprop-registers -fdefer-pop -fdelayed-branch -fguess-branch-probability -fif-conversion2 -fif-conversion -fipa-pure-const -fipa-reference -fmerge-constants -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time -fthread-jumps -falign-functions -falign-jumps -falign-loops -falign-labels -fcaller-saves -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse -fgcse-lm -fpeephole2 -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fsched-interblock -fsched-spec -fschedule-insns -fschedule-insns2 -fstrict-aliasing -ftree-pre -ftree-vrp -finline-functions -funswitch-loops -fgcse-after-reload -ftree-vectorize
$ ./dg
time_iter: 55 s, time_rec: 89 s, ratio (iter/rec): 0.617978
return values: 0.000487, 0.000487
Recursion, without the tail-call optimization, performs worse than iteration, in the same way as when compiled with NO optimization. Read about compiler optimizations here.
EDIT:
As a verification of correctness I updated my code include the return values. Also, I set two static variables to 0 and incremented each on recursion and iteration to verify correct output:
int a = 0;
int b = 0;
double descgrad(double xo, double xnew, double eps, double precision){
if (fabs(xnew - xo) <= precision) {
return xnew;
} else {
a++;
return descgrad(xnew, xnew - eps*2*xnew, eps, precision);
}
}
double descgraditer(double xo, double xnew, double eps, double precision){
double Xo = xo;
double Xn = xnew;
while(fabs(Xn-Xo) > precision){
b++;
Xo = Xn;
Xn = Xo - eps * 2*Xo;
}
return Xn;
}
int main() {
time_t s1, e1, d1, s2, e2, d2;
int i, iter = 10000000;
double a1, a2;
s1 = time(NULL);
for( i = 0; i < iter; i++ ){
a1 = descgraditer(100,99,0.01,0.00001);
}
e1 = time(NULL);
d1 = difftime( e1, s1 );
s2 = time(NULL);
for( i = 0; i < iter; i++ ){
a2 = descgrad(100,99,0.01,0.00001);
}
e2 = time(NULL);
d2 = difftime( e2, s2 );
printf( "time_iter: %d s, time_rec: %d s, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ;
printf( "return values: %f, %f\n", a1, a2 );
printf( "number of recurs/iters: %d, %d\n", a, b );
}
The output:
$ gcc optimization.c -O3 -lrt -o dg
$ ./dg
time_iter: 41 s, time_rec: 24 s, ratio (iter/rec): 1.708333
return values: 0.000487, 0.000487
number of recurs/iters: 1755032704, 1755032704
The answers are the same, and the repetition is the same.
Also interesting to note, the static variable fetching/incrementing has a considerable impact on the tail-call optimization, however recursion still out-performs iteration.
Upvotes: 2
Reputation: 153909
First, clock_gettime
seems to be measuring wall clock time, not
execution time. Second, the actual time you're measuring is the
execution time of printf
, not the execution time of your function.
And third, the first time you call printf
, it isn't in memory, so it
has to be paged in, involving significant disk IO. Inverse the order
you run the tests, and the results will inverse as well.
If you want to get some significant measurements, you have to make sure that
Upvotes: 1
Reputation: 9422
For starters, you were including a printf in the time you were trying to measure. It's always a giant no-no because it can, and most likely will, suspend your process while doing the console output. Actually doing ANY system call can completely throw off time measurements like these.
And secondly, as someone else mentioned, on this short of a sampling period, scheduler interrupts can have a huge impact.
This is not perfect, but try this for your main and you'll see there's actually very little difference. As you increase the loop count, the ratio approaches 1.0.
#define LOOPCOUNT 100000
int main()
{
struct timespec s1, e1, s2, e2;
int i;
clock_gettime(CLOCK_MONOTONIC, &s1);
for(i=0; i<LOOPCOUNT; i++)
{
descgraditer(100,99,0.01,0.00001);
}
clock_gettime(CLOCK_MONOTONIC, &e1);
clock_gettime(CLOCK_MONOTONIC, &s2);
for(i=0; i<LOOPCOUNT; i++)
{
descgrad(100,99,0.01,0.00001);
}
clock_gettime(CLOCK_MONOTONIC, &e2);
uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;
printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));
printf("End. \n");
}
EDIT: After looking at the disassembled output using objdump -dS
I noticed a few things:
With -O3 optimization, the above code optimizes the function call away completely. However, it does still produce code for the two functions and neither is actually recursive.
Secondly, with -O0, such that the resulting code is actually recursive, the recursive version is literally a trillion times slower. My guess is because the call stack forces variables to end up in memory where the iterative version runs out of registers and/or cache.
Upvotes: 2
Reputation: 22089
I've compiled and run your code locally. Moving the printf
outside of the timed block makes both versions execute in ~5ms every time.
So a central mistake in your timing is that you measure the complex beast of printf
and its runtime dwarfs the code you are actually trying to measure.
My main()
-function now looks like this:
int main() {
struct timespec s1, e1, s2, e2;
double d = 0.0;
clock_gettime(CLOCK_MONOTONIC, &s1);
d = descgraditer(100,99,0.01,0.00001);
clock_gettime(CLOCK_MONOTONIC, &e1);
printf("Minimum : %f\n", d);
clock_gettime(CLOCK_MONOTONIC, &s2);
d = descgrad(100,99,0.01,0.00001);
clock_gettime(CLOCK_MONOTONIC, &e2);
printf("Minimum : %f\n",d);
uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;
printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));
printf("End. \n");
}
Upvotes: 10
Reputation: 36049
Is my way of measuring time wrong?
Yes. In the short timespans you are measuring, the scheduler can have a massive impact on your program. You need to either make your test much longer to average such differences out, or to use CLOCK_PROCESS_CPUTIME_ID
instead to measure the CPU time used by your process.
Upvotes: 5
Reputation: 65264
In many cases on modern hardware cache misses are the limiting factor of performance for small loop constructs. A recursive implementation is less likely to create cache misses on the instruction path.
Upvotes: 0
Reputation: 22089
For one thing, your recursive step misses a return
:
double descgrad(double xo, double xnew, double eps, double precision)
{
if (fabs(xnew - xo) < precision)
return xnew;
else
descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}
Should be:
double descgrad(double xo, double xnew, double eps, double precision)
{
if (fabs(xnew - xo) < precision)
return xnew;
else
return descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}
This oversight causes the return value of descgrad
to be undefined, so the compiler barely has to generate code for it at all ;)
Upvotes: 3