MalphasWats
MalphasWats

Reputation: 3295

Why is this nested for loop so much slower than unrolling the same code?

I'm building a little toy console using the ATtiny85 and a 128x64px OLED. In my initial build I used the built-in shiftOut() and digitalWrite() functions to shift display data out to the screen controller.

This netted me ~5fps, which was a little disappointing.

I wrote my own function that uses direct port manipulation to send the data and got a dramatic increase in speed ~23fps, which is not bad. Here's that function:

void shift_out_block(block)
{
    byte b;
    for (byte i = 0; i < 8; i++)  
    {
        b = pgm_read_byte(block+i);

        for (byte j=0 ; j < 8 ; j++)
        {
            if ( !!( b & (1 << j)) )
            {
                PORTB |= 1 << SDA;
            }
            else
            {
                PORTB &= ~(1 << SDA);
            }

            PORTB |= 1 << SCL; // HIGH
            PORTB &= ~(1 << SCL); // LOW
        }
    }
}

23fps is ok, but it's no 30 or even 60fps (I'd actually have left it here if it was 24fps, but odd numbers...).

I understand why removing the library calls and manipulating ports directly improved things so much - the libraries are written to work on all sorts of different MCUs.

I vaguely remembered loop unravelling being a thing, so I unravelled the inner for loop:

void shift_out_block()
{
    byte b;
    for (byte i = 0; i < 8; i++)  
    {
        b = pgm_read_byte(block+i);

        if ( !!( b & (1 << 0)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 1)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 2)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 3)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 4)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 5)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 6)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 7)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW
    }
}

No effort at all, copy-paste-7-times. Gives me nearly 75fps - the original function executes in ~42ms, the new ugly one takes only ~13ms.

Out of interest, I broke the send bit part out as a separate function and called that 8 times:

void shift_out_bit(bool bit)
{
    if ( bit )
    {
        PORTB |= 1 << SDA;
    }
    else
    {
        PORTB &= ~(1 << SDA);
    }

    PORTB |= 1 << SCL; // HIGH
    PORTB &= ~(1 << SCL); // LOW
}

void shift_out_block()
{
    byte b;
    for (byte i = 0; i < 8; i++)  
    {
        b = pgm_read_byte(block+i);

        shift_out_bit( !!( b & (1 << 0)) );
        shift_out_bit( !!( b & (1 << 1)) );
        shift_out_bit( !!( b & (1 << 2)) );
        shift_out_bit( !!( b & (1 << 3)) );
        shift_out_bit( !!( b & (1 << 4)) );
        shift_out_bit( !!( b & (1 << 5)) );
        shift_out_bit( !!( b & (1 << 6)) );
        shift_out_bit( !!( b & (1 << 7)) );
    }
}

~22ms to execute, or 45.4545454545 fps, which isn't even nearly a nice number.

I'm not a C programmer by any stretch of the imagination - Python is my usual haunt (I did initially start this project in Python/RPi, but very quickly gave up on that!).

Why is such a core language feature so much slower in this situation? As my project becomes more complex, what other optimisations should I be thinking about?

Upvotes: 1

Views: 128

Answers (1)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726987

Consider the "payload" operations done inside of your innermost loop:

  • Check of a specific bit in b
  • Conditional jump to deal with PORTB |= 1 << SDA vs. PORTB &= ~(1 << SDA)
  • Three operations on PORTB

That is all the unrolled version of the loop does; nothing else needs to be done, not even shifting 1 to the left j times, because the compiler evaluates constant expressions, and 1 << 4 becomes simply 16.

On the other hand, the loop without unrolling must do additional things to keep looping:

  • Incrementing j
  • Comparing j to 8
  • Shifting 1 left by j positions
  • Unconditional jump to the beginning of the loop

When the loop is unrolled, the CPU is no longer burdened with these "non-payload" instructions, so the execution speed goes up.

Why is such a core language feature so much slower in this situation?

Many modern compilers would unroll loops automatically for you, depending on the optimization settings.

Upvotes: 3

Related Questions