Pedrom
Pedrom

Reputation: 3823

Why a recursive version of a function would be faster than an iterative one in C?

I am checking the difference between two implementations of gradient descent, my guess was that with after compiler optimization both versions of the algorithm would be equivalent.

For my surprise, the recursive version was significantly faster. I haven't discard an actual defect on any of the versions or even in the way I am measuring the time. Can you guys give me some insights please?

This is my code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <stdint.h>

double f(double x)
{
        return 2*x;
}

double descgrad(double xo, double xnew, double eps, double precision)
{
//      printf("step ... x:%f Xp:%f, delta:%f\n",xo,xnew,fabs(xnew - xo));

        if (fabs(xnew - xo) < precision)
        {
                return xnew;
        }
        else
        {
                descgrad(xnew, xnew - eps*f(xnew), eps, precision);
        }
}

double descgraditer(double xo, double xnew, double eps, double precision)
{
        double Xo = xo;
        double Xn = xnew;

        while(fabs(Xn-Xo) > precision)
        {
                //printf("step ... x:%f Xp:%f, delta:%f\n",Xo,Xn,fabs(Xn - Xo));
                Xo = Xn;
                Xn = Xo - eps * f(Xo);
        }

        return Xn;
}

int64_t timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
  return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
           ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}

int main()
{
        struct timespec s1, e1, s2, e2;

        clock_gettime(CLOCK_MONOTONIC, &s1);
        printf("Minimum : %f\n",descgraditer(100,99,0.01,0.00001));
        clock_gettime(CLOCK_MONOTONIC, &e1);

        clock_gettime(CLOCK_MONOTONIC, &s2);
        printf("Minimum : %f\n",descgrad(100,99,0.01,0.00001));
        clock_gettime(CLOCK_MONOTONIC, &e2);

        uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
        uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;

        printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));

        printf("End. \n");
}

I am compiling with gcc 4.5.2 on Ubuntu 11.04 with the following options: gcc grad.c -O3 -lrt -o dg

The output of my code is:

Minimum : 0.000487
Minimum : 0.000487
time_iter:127 ms, time_rec:19 ms, ratio (dif1/dif2) :6.68421
End.

I read a thread which also ask about a recursive version of an algorithm being faster than the iterative one. The explanation over there was that being the recursive version using the stack and the other version using some vectors the access on the heap was slowing down the iterative version. But in this case (in the best of my understanding) I am just using the stack on both cases.

Am I missing something? Anything obvious that I am not seeing? Is my way of measuring time wrong? Any insights?

EDIT: Mystery solved in a comment. As @TonyK said the initialization of the printf was slowing down the first execution. Sorry that I missed that obvious thing.

BTW, The code compiles just right without warnings. I don't think the "return descgrad(.." is necessary since the stop condition is happening before.

Upvotes: 7

Views: 1404

Answers (7)

Christopher Neylan
Christopher Neylan

Reputation: 8272

The accepted answer is incorrect.

There IS a difference in the runtimes of the iterative function and the recursive function and the reason is the compiler optimization -foptimize-sibling-calls added by -O3.

First, the code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <stdint.h>

double descgrad(double xo, double xnew, double eps, double precision){
        if (fabs(xnew - xo) <= precision) {
                return xnew;
        } else {
                return descgrad(xnew, xnew - eps*2*xnew, eps, precision);
        }
}

double descgraditer(double xo, double xnew, double eps, double precision){
        double Xo = xo;
        double Xn = xnew;

        while(fabs(Xn-Xo) > precision){
                Xo = Xn;
                Xn = Xo - eps * 2*Xo;
        }
        return Xn;
}

int main() {
        time_t s1, e1, d1, s2, e2, d2;
        int i, iter = 10000000;
        double a1, a2;

        s1 = time(NULL);
        for( i = 0; i < iter; i++ ){
            a1 = descgraditer(100,99,0.01,0.00001);
        }
        e1 = time(NULL);
        d1 = difftime( e1, s1 );

        s2 = time(NULL);
        for( i = 0; i < iter; i++ ){
            a2 = descgrad(100,99,0.01,0.00001);
        }
        e2 = time(NULL);
        d2 = difftime( e2, s2 );

    printf( "time_iter: %d s, time_rec: %d s, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ;
    printf( "return values: %f, %f\n", a1, a2 );
}

Previous posts were correct in pointing out that you need to iterate many times in order to average away environment interference. Given that, I discarded your differencing function in favor of time.h's difftime function on time_t data, since over many iterations, anything finer than a second is meaningless. In addition, I removed the printfs in the benchmark.

I also fixed a bug in the recursive implementation. Your original code's if-statement checked for fabs(xnew-xo) < precision, which is incorrect (or at least different from the iterative implementation). The iterative loops while fabs() > precision, therefore the recursive function should not recurse when fabs <= precision. Adding 'iteration' counters to both functions confirms that this fix makes the function logically equivalent.

Compiling and running with -O3:

$ gcc test.c -O3 -lrt -o dg
$ ./dg
time_iter: 34 s, time_rec: 0 s, ratio (iter/rec): inf
return values: 0.000487, 0.000487

Compiling and running without -O3

$ gcc test.c -lrt -o dg
$ ./dg
time_iter: 54 s, time_rec: 90 s, ratio (iter/rec): 0.600000
return values: 0.000487, 0.000487

Under no optimization, iteration performs BETTER than recursion.

Under -O3 optimization, however, recursion runs ten-million iterations in under a second. The reason is that it adds -foptimize-sibling-calls, which optimizes sibling and tail recursive calls, which is exactly what your recursive function is taking advantage of.

To be sure, I ran it will all -O3 optimizations EXCEPT -foptimize-sibling-calls:

$ gcc test.c -lrt -o dg  -fcprop-registers  -fdefer-pop -fdelayed-branch  -fguess-branch-probability -fif-conversion2 -fif-conversion -fipa-pure-const -fipa-reference -fmerge-constants   -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time -fthread-jumps -falign-functions  -falign-jumps -falign-loops  -falign-labels -fcaller-saves -fcrossjumping -fcse-follow-jumps  -fcse-skip-blocks -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse  -fgcse-lm  -fpeephole2 -fregmove -freorder-blocks  -freorder-functions -frerun-cse-after-loop  -fsched-interblock  -fsched-spec -fschedule-insns  -fschedule-insns2 -fstrict-aliasing  -ftree-pre -ftree-vrp -finline-functions -funswitch-loops  -fgcse-after-reload -ftree-vectorize
$ ./dg
time_iter: 55 s, time_rec: 89 s, ratio (iter/rec): 0.617978
return values: 0.000487, 0.000487

Recursion, without the tail-call optimization, performs worse than iteration, in the same way as when compiled with NO optimization. Read about compiler optimizations here.

EDIT:

As a verification of correctness I updated my code include the return values. Also, I set two static variables to 0 and incremented each on recursion and iteration to verify correct output:

int a = 0;
int b = 0;

double descgrad(double xo, double xnew, double eps, double precision){
        if (fabs(xnew - xo) <= precision) {
                return xnew;
        } else {
                a++;
                return descgrad(xnew, xnew - eps*2*xnew, eps, precision);
        }
}

double descgraditer(double xo, double xnew, double eps, double precision){
        double Xo = xo;
        double Xn = xnew;

        while(fabs(Xn-Xo) > precision){
                b++;
                Xo = Xn;
                Xn = Xo - eps * 2*Xo;
        }
        return Xn;
}

int main() {
    time_t s1, e1, d1, s2, e2, d2;
    int i, iter = 10000000;
    double a1, a2;

    s1 = time(NULL);
    for( i = 0; i < iter; i++ ){
        a1 = descgraditer(100,99,0.01,0.00001);
    }
    e1 = time(NULL);
    d1 = difftime( e1, s1 );

    s2 = time(NULL);
    for( i = 0; i < iter; i++ ){
        a2 = descgrad(100,99,0.01,0.00001);
    }
    e2 = time(NULL);
    d2 = difftime( e2, s2 );

    printf( "time_iter: %d s, time_rec: %d s, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ;
    printf( "return values: %f, %f\n", a1, a2 );
    printf( "number of recurs/iters: %d, %d\n", a, b );
}

The output:

$ gcc optimization.c -O3 -lrt -o dg
$ ./dg
time_iter: 41 s, time_rec: 24 s, ratio (iter/rec): 1.708333
return values: 0.000487, 0.000487
number of recurs/iters: 1755032704, 1755032704

The answers are the same, and the repetition is the same.

Also interesting to note, the static variable fetching/incrementing has a considerable impact on the tail-call optimization, however recursion still out-performs iteration.

Upvotes: 2

James Kanze
James Kanze

Reputation: 153909

First, clock_gettime seems to be measuring wall clock time, not execution time. Second, the actual time you're measuring is the execution time of printf, not the execution time of your function. And third, the first time you call printf, it isn't in memory, so it has to be paged in, involving significant disk IO. Inverse the order you run the tests, and the results will inverse as well.

If you want to get some significant measurements, you have to make sure that

  1. only the code you want to measure is in the measurement loops, or at least, the additional code is very minimum compared to what you're measuring,
  2. you do something with the results, so that the compiler can't optimize all of the code out (not a problem in your tests),
  3. you execute the code to be measured a large number of times, taking the average,
  4. you measure CPU time, and not wall clock time, and
  5. you make sure that everything is paged in before starting the measurements.

Upvotes: 1

Brian McFarland
Brian McFarland

Reputation: 9422

For starters, you were including a printf in the time you were trying to measure. It's always a giant no-no because it can, and most likely will, suspend your process while doing the console output. Actually doing ANY system call can completely throw off time measurements like these.

And secondly, as someone else mentioned, on this short of a sampling period, scheduler interrupts can have a huge impact.

This is not perfect, but try this for your main and you'll see there's actually very little difference. As you increase the loop count, the ratio approaches 1.0.

#define LOOPCOUNT 100000
int main() 
{
    struct timespec s1, e1, s2, e2;
    int i;
    clock_gettime(CLOCK_MONOTONIC, &s1);
    for(i=0; i<LOOPCOUNT; i++)
    {
      descgraditer(100,99,0.01,0.00001);
    }
    clock_gettime(CLOCK_MONOTONIC, &e1);

    clock_gettime(CLOCK_MONOTONIC, &s2);
    for(i=0; i<LOOPCOUNT; i++)
    {
      descgrad(100,99,0.01,0.00001);
    }
    clock_gettime(CLOCK_MONOTONIC, &e2);

    uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
    uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;

    printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));

    printf("End. \n");

}

EDIT: After looking at the disassembled output using objdump -dS I noticed a few things:
With -O3 optimization, the above code optimizes the function call away completely. However, it does still produce code for the two functions and neither is actually recursive.

Secondly, with -O0, such that the resulting code is actually recursive, the recursive version is literally a trillion times slower. My guess is because the call stack forces variables to end up in memory where the iterative version runs out of registers and/or cache.

Upvotes: 2

Magnus Hoff
Magnus Hoff

Reputation: 22089

I've compiled and run your code locally. Moving the printf outside of the timed block makes both versions execute in ~5ms every time.

So a central mistake in your timing is that you measure the complex beast of printf and its runtime dwarfs the code you are actually trying to measure.

My main()-function now looks like this:

int main() {
    struct timespec s1, e1, s2, e2;

    double d = 0.0;

    clock_gettime(CLOCK_MONOTONIC, &s1);
    d = descgraditer(100,99,0.01,0.00001);
    clock_gettime(CLOCK_MONOTONIC, &e1);
    printf("Minimum : %f\n", d);

    clock_gettime(CLOCK_MONOTONIC, &s2);
    d = descgrad(100,99,0.01,0.00001);
    clock_gettime(CLOCK_MONOTONIC, &e2);
    printf("Minimum : %f\n",d);

    uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
    uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;

    printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));

    printf("End. \n");
}

Upvotes: 10

thiton
thiton

Reputation: 36049

Is my way of measuring time wrong?

Yes. In the short timespans you are measuring, the scheduler can have a massive impact on your program. You need to either make your test much longer to average such differences out, or to use CLOCK_PROCESS_CPUTIME_ID instead to measure the CPU time used by your process.

Upvotes: 5

Eugen Rieck
Eugen Rieck

Reputation: 65264

In many cases on modern hardware cache misses are the limiting factor of performance for small loop constructs. A recursive implementation is less likely to create cache misses on the instruction path.

Upvotes: 0

Magnus Hoff
Magnus Hoff

Reputation: 22089

For one thing, your recursive step misses a return:

double descgrad(double xo, double xnew, double eps, double precision)
{
    if (fabs(xnew - xo) < precision)
        return xnew;
    else
        descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}

Should be:

double descgrad(double xo, double xnew, double eps, double precision)
{
    if (fabs(xnew - xo) < precision)
        return xnew;
    else
        return descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}

This oversight causes the return value of descgrad to be undefined, so the compiler barely has to generate code for it at all ;)

Upvotes: 3

Related Questions