Reputation: 323

Why does stack-based iteration not preform better than recursion in C? (Using Fibonacci number as an example)

Brief Intro:

(Minor edit: My intention of asking this question is not about the algorithms themselves. I'm fully aware of the fast iterative solution with 3 local variables or an array of 3 elements. Actually, I deliberately make both tests to have the lowest complexity difference as what I could come up with. What I'd like to know about is that if implementing an identical algorithm to the recursive one attacking the same problem in custom stacks and iteration can improve the performance!)

When we learned about programming at school, we were usually told that iteration would generally be more efficient comparing to recursion, unless recursion gives a specific and elegant way to solve the problem.

So recently I decided to put this into a simple test. Given that function calls are essentially handled by call stacks, thus it's possible to implement a custom stack to handle required local variables and turn a recursive implementation into a iterative one. Below is an example of me implementing a Fibonacci number calculator base on recursion and supposedly an equivalent algorithm using iteration, both written in C.

Method:

Recursive impl. (fibon_recu):

uint64_t calls = 0;

/* Calculate Fibonacci number using recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
  calls++;
  return (idx <= 1) ? idx : (fibonacci(idx - 1) + fibonacci(idx - 2));
}

Iterative impl. (fibon_iter):

uint64_t loop_count;

/* Calculate Fibonacci number using stack-based method derived from recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
  uint64_t ret = 0;
  uint8_t stack_val[ARR_STACK_SIZE], cache;
  uint16_t stack_size;

  loop_count = 0;

  // Push index into simulated stack
  stack_size = 1;
  *stack_val = idx;

  while(stack_size)
  {
    // Pop simulated stack top
    stack_size -= 1;
    cache = *(stack_val + stack_size);

    if(cache > 1)
    {
      // Push <index - 1> and <index - 2> into simulated stack
      *(stack_val + stack_size) = cache - 1;
      *(stack_val + stack_size + 1) = cache - 2;
      stack_size += 2;
    }
    else
    {
      ret += cache;
    }

    loop_count++;
  }

  return ret;
}

It's worth mentioning that the answer by templatetypedef says:

Depending on the implementation of the stack, that stack might end up needing to dynamically allocate memory in the heap, which is typically more expensive than setting up and tearing down stack frames. As always, if you have two algorithms and want to compare their empirical performance, it's probably best to just run the two against one another and see which ones win.

which is indeed observed on my testing device, I decided to use static array to simulated the stack, as the static array itself will be allocated in the stack frames instead of in heaps. Accessing variables in heaps in this case on my device makes the performance about 20-30 times worse (figure not shown here).

Nonetheless, this answer by Nathaniel Ford suggested that iterative implementation with custom stacks can sometimes improve performance, which I consider being quite interesting.

Also, to make the comparison as fair as what I could achieve, I compiled both pieces of code with -O0 flag to disable compiler optimization(s).

Results:

Where fibon_recu stands for recursive tests and fibon_iter stands for iterative tests. Just for references in case these information are relative, I decided to put both gcc version and basic device info on the figure.

Update:

<Sep. 7, 2021; 11:00 UTC>
Thanks to Ian Abbott for pointing out that using pointer access instead of indexing can improve performance as -O0 flag is present. This brings the execution time for iterative tests down to just being slightly longer than recursive tests.

<Sep. 7, 2021; 22:45 UTC>
Thanks to all generous devs here for the insights and even detailed tests! I made some updates reading those answers and also marked the question as solved. However, please read all the answers below as all of them have different but important perspectives!

First, Eugene pointed out that the function calls themselves could've been optimized even with -O0 flag, while the home-brew stack implementation is not. Thus -O0 is actually still not a fair test after all, as what Lundin suggests.

Second, Jacon pointed out that tail recursion may not be replaced by compiler up to -O1, while -O1 gives a significant improvement on performance for stack-based iterative test conducted by John.

The assembly output and the results suggested by these mentioned answers are listed below:

-O1:
- fibon_iter_O1.asm
- fibon_recu_O1.asm
- Test result:
-O2:
- fibon_iter_O2.asm
- fibon_recu_O2.asm
- Test result:

It turns out the compiler behavior mentioned by Eugene does hold on my test environment, with -O1 flag preserves all the recursive bl calls while -O2 flag optimized the tailing bl call. The improvement observed by John can also be reproduced on my device, shown in the results above: the -O1 flag makes iteration preform much better than recursion.

Thus, I'd guess the test results show that the complexity of the instructions and recursion calls are competing with each other in terms of efficiency and performance. The -O0 flag gives no optimization for home-brew stack at all, resulting in extra complexity that breaks even the advantages of iteration. The -O1 flag retains the recursive calls while optimizing the iterative implementation, granting the latter a performance boost. The -O2 flag removes the tail recursion and makes recursive implementation runs better than that under -O1, making the competing factors apparent again.

Original Question:

Why does the seemingly equivalent iterative implementation still preform not better than the recursive one? Did I get anything obviously wrong, or something is actually hidden under plain sight?

Thanks!

Upvotes: 3

Answers (4)

John Bode

Reputation: 123448

What I'd like to know about is that if implementing an identical algorithm to the recursive one attacking the same problem in custom stacks and iteration can improve the performance!

This assumes the algorithms are actually identical. Your stack-based algorithm isn't identical to the recursive version - for one thing, you're performing a lot more assignments and tests in the stack-based version than the recursive version, which kills any gains you make by not calling the function recursively. Secondly, the recursive version is (most likely) passing arguments via registers, not the stack.

On my system, turning on level 1 optimization significantly improves the runtime performance of the stack-based version relative to the recursive version (to where the stack-based version takes much less time than the recursive version). On my system, going from -O0 to -O1 speeds up the recursive version by ~26% and the stack-based version by ~67%.

And for what it's worth, this is a naive iterative implementation of fibonacci:

unsigned long fib_it( int n )
{
  /**
   * Circular buffer to store the
   * intermediate values
   */
  unsigned long f[3] = { 0ul, 1ul, 1ul };

  if ( n <= 2 )
    return f[n];

  int i;
  for ( i = 3; i <= n; i++ )
    f[i % 3] = f[(i-1) % 3] + f[(i-2) % 3];

  return f[(i+2) % 3];
}

which absolutely spanks the recursive and stack-based versions in terms of both run time and memory usage as n gets large (you basically just have to worry about overflow). As others have shown there are faster algorithms that minimize the number of intermediate calculations, and there is a closed-form version that's faster yet.

If a recursive algorithm is executing too slowly for you, you don't speed it up by re-implementing the recursiveness of it with your own stacks - you speed it up by using a non-recursive approach. As a definition of the Fibonacci function, F_n = F_n-1 + F_n-2 is compact and easy to understand, but as an implementation of the function it is horribly inefficient because it computes the same values multiple times. The iterative version above computes each value exactly once.

What makes recursive functions like quicksort fast is that they dramatically reduce the size of the problem space with each call. Recursive fibonacci doesn't do that.

Edit

Some numbers. My test harness takes two arguments - the N'th number to compute, and the algorithm to use - s for your stack-based version, r for recursive, and i for my iterative version above. For small N the differences aren't too dramatic (compiled with -O1):

$ ./fib -n 5 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         1  0.000001

$ ./fib -n 5 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         2  0.000002

$ ./fib -n 5 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         2  0.000002

As N gets larger, though, differences show up:

$ ./fib -n 40 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155    543262  0.543262

$ ./fib -n 40 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155    330824  0.330824

$ ./fib -n 40 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155         2  0.000002

Once you get past N=50 the run times for the recursive and stack-based versions get obnoxiously long, whereas the runtime for the iterative version doesn't change:

 $ ./fib -n 50 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025  66602759 66.602759

 $ ./fib -n 50 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025  39850376 39.850376

 $ ./fib -n 50 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025         2  0.000002

Now, this is how things behave on my system, and this is a 0th level analysis using very crude instrumentation (a couple of calls to clock).

Upvotes: 3

user16719198

Reputation:

I compared a two factorial functions with -O3 and -fno-inline. Because with inlining both versions amount to the same! Even so the recursive one does not call itself; but it somehow keeps an extra instruction:

recursive: sub and cmp

11e0:       |  /-> 89 fa                    mov    edx,edi
11e2:       |  |   83 ef 01                 sub    edi,0x1
11e5:       |  |   0f af c2                 imul   eax,edx
11e8:       |  |   83 ff 01                 cmp    edi,0x1
11eb:       |  \-- 75 f3                    jne    11e0 <frec+0x10>

non-rec: tighter inner loop:

11c0:       |  /-> 44 0f af c0              imul   r8d,eax
11c4:       |  |   83 e8 01                 sub    eax,0x1
11c7:       |  \-- 75 f7                    jne    11c0 <fnonrec+0x10>

program:

int fnonrec(int n) {
    int p = 1;
    while (n-- > 1)
        p *= n;
    return p;

}
int frec(int n) {
    if (n == 1)
        return 1;
    return (n * frec(n-1));
}
int main(void) {
    for (int i = 0; i < 10000; i++)
        //printf("%d\n", frec(rand()%1000 + 1));
        printf("%d\n", fnonrec(rand()%1000 + 1));
}

Both take 8 ms redirected to dev/null, the recursive does 50% more instructions in above (sloppy, overflowing) form.

Up to -O1 frec() calls itself recursively; without a "keep-recursion" compilation flag a fair comparison is difficult. Only inlining can be controlled, which is linked to recursion:

a recursive function cannot be inlined and keep the recursion. It would only suppress one of n calls.
a recursive function can be called w/o recursion even with fno-inline. Even though the compiler inlines it into itself. Removing recursion is not inlining.

Upvotes: 1

Eugene Ryabtsev

Reputation: 2301

It's just more assembly and more stack access.

As opposed to register-based parameter-passing. You could easily see the differences between first and second algorithms. More assembly code does not necessarily translate to slower speed, but in this case, it does.

Contemporary mainstream ISAs and APIs have some built-in functionality to implement function calls, which is active even at -O0. Not so much optimization for your homebrew stack structure.

Upvotes: 2

Lundin

Reputation: 213523

Why does the seemingly equivalent iterative implementation still preform not better than the recursive one? Did I get anything obviously wrong, or something is actually hidden under plain sight?

Because benchmarking code with optimization disabled is utter nonsense. gcc -O0 disables all optimization (but supposedly improves compilation speed). You have effectively disabled loop unrolling, tail call optimization and everything else that is extremely relevant when comparing efficiency of these two snippets.

Also, to make the comparison as fair as what I could achieve, I compiled both pieces of code with -O0 flag to disable compiler optimization(s).

No, you made it as unfair as you could achieve.

You essentially took two sport cars, removed the engines then manually pushed them each down two different slopes to see which car that is fastest...