Why is this nested for loop so much slower than unrolling the same code?

Question

I'm building a little toy console using the ATtiny85 and a 128x64px OLED. In my initial build I used the built-in shiftOut() and digitalWrite() functions to shift display data out to the screen controller.

This netted me ~5fps, which was a little disappointing.

I wrote my own function that uses direct port manipulation to send the data and got a dramatic increase in speed ~23fps, which is not bad. Here's that function:

void shift_out_block(block)
{
    byte b;
    for (byte i = 0; i < 8; i++)  
    {
        b = pgm_read_byte(block+i);

        for (byte j=0 ; j < 8 ; j++)
        {
            if ( !!( b & (1 << j)) )
            {
                PORTB |= 1 << SDA;
            }
            else
            {
                PORTB &= ~(1 << SDA);
            }

            PORTB |= 1 << SCL; // HIGH
            PORTB &= ~(1 << SCL); // LOW
        }
    }
}

23fps is ok, but it's no 30 or even 60fps (I'd actually have left it here if it was 24fps, but odd numbers...).

I understand why removing the library calls and manipulating ports directly improved things so much - the libraries are written to work on all sorts of different MCUs.

I vaguely remembered loop unravelling being a thing, so I unravelled the inner for loop:

void shift_out_block()
{
    byte b;
    for (byte i = 0; i < 8; i++)  
    {
        b = pgm_read_byte(block+i);

        if ( !!( b & (1 << 0)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 1)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 2)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 3)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 4)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 5)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 6)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW

        if ( !!( b & (1 << 7)) )
        {
            PORTB |= 1 << SDA;
        }
        else
        {
            PORTB &= ~(1 << SDA);
        }

        PORTB |= 1 << SCL; // HIGH
        PORTB &= ~(1 << SCL); // LOW
    }
}

No effort at all, copy-paste-7-times. Gives me nearly 75fps - the original function executes in ~42ms, the new ugly one takes only ~13ms.

Out of interest, I broke the send bit part out as a separate function and called that 8 times:

void shift_out_bit(bool bit)
{
    if ( bit )
    {
        PORTB |= 1 << SDA;
    }
    else
    {
        PORTB &= ~(1 << SDA);
    }

    PORTB |= 1 << SCL; // HIGH
    PORTB &= ~(1 << SCL); // LOW
}

void shift_out_block()
{
    byte b;
    for (byte i = 0; i < 8; i++)  
    {
        b = pgm_read_byte(block+i);

        shift_out_bit( !!( b & (1 << 0)) );
        shift_out_bit( !!( b & (1 << 1)) );
        shift_out_bit( !!( b & (1 << 2)) );
        shift_out_bit( !!( b & (1 << 3)) );
        shift_out_bit( !!( b & (1 << 4)) );
        shift_out_bit( !!( b & (1 << 5)) );
        shift_out_bit( !!( b & (1 << 6)) );
        shift_out_bit( !!( b & (1 << 7)) );
    }
}

~22ms to execute, or 45.4545454545 fps, which isn't even nearly a nice number.

I'm not a C programmer by any stretch of the imagination - Python is my usual haunt (I did initially start this project in Python/RPi, but very quickly gave up on that!).

Why is such a core language feature so much slower in this situation? As my project becomes more complex, what other optimisations should I be thinking about?

Sergey Kalinichenko · Accepted Answer

Consider the "payload" operations done inside of your innermost loop:

Check of a specific bit in b
Conditional jump to deal with PORTB |= 1 << SDA vs. PORTB &= ~(1 << SDA)
Three operations on PORTB

That is all the unrolled version of the loop does; nothing else needs to be done, not even shifting 1 to the left j times, because the compiler evaluates constant expressions, and 1 << 4 becomes simply 16.

On the other hand, the loop without unrolling must do additional things to keep looping:

Incrementing j
Comparing j to 8
Shifting 1 left by j positions
Unconditional jump to the beginning of the loop

When the loop is unrolled, the CPU is no longer burdened with these "non-payload" instructions, so the execution speed goes up.

Why is such a core language feature so much slower in this situation?

Many modern compilers would unroll loops automatically for you, depending on the optimization settings.

Why is this nested for loop so much slower than unrolling the same code?

Answers (1)

Related Questions