Reputation: 3295
I'm building a little toy console using the ATtiny85 and a 128x64px OLED. In my initial build I used the built-in shiftOut()
and digitalWrite()
functions to shift display data out to the screen controller.
This netted me ~5fps, which was a little disappointing.
I wrote my own function that uses direct port manipulation to send the data and got a dramatic increase in speed ~23fps, which is not bad. Here's that function:
void shift_out_block(block)
{
byte b;
for (byte i = 0; i < 8; i++)
{
b = pgm_read_byte(block+i);
for (byte j=0 ; j < 8 ; j++)
{
if ( !!( b & (1 << j)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
}
}
}
23fps is ok, but it's no 30 or even 60fps (I'd actually have left it here if it was 24fps, but odd numbers...).
I understand why removing the library calls and manipulating ports directly improved things so much - the libraries are written to work on all sorts of different MCUs.
I vaguely remembered loop unravelling being a thing, so I unravelled the inner for
loop:
void shift_out_block()
{
byte b;
for (byte i = 0; i < 8; i++)
{
b = pgm_read_byte(block+i);
if ( !!( b & (1 << 0)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 1)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 2)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 3)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 4)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 5)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 6)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
if ( !!( b & (1 << 7)) )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
}
}
No effort at all, copy-paste-7-times. Gives me nearly 75fps - the original function executes in ~42ms, the new ugly one takes only ~13ms.
Out of interest, I broke the send bit part out as a separate function and called that 8 times:
void shift_out_bit(bool bit)
{
if ( bit )
{
PORTB |= 1 << SDA;
}
else
{
PORTB &= ~(1 << SDA);
}
PORTB |= 1 << SCL; // HIGH
PORTB &= ~(1 << SCL); // LOW
}
void shift_out_block()
{
byte b;
for (byte i = 0; i < 8; i++)
{
b = pgm_read_byte(block+i);
shift_out_bit( !!( b & (1 << 0)) );
shift_out_bit( !!( b & (1 << 1)) );
shift_out_bit( !!( b & (1 << 2)) );
shift_out_bit( !!( b & (1 << 3)) );
shift_out_bit( !!( b & (1 << 4)) );
shift_out_bit( !!( b & (1 << 5)) );
shift_out_bit( !!( b & (1 << 6)) );
shift_out_bit( !!( b & (1 << 7)) );
}
}
~22ms to execute, or 45.4545454545 fps, which isn't even nearly a nice number.
I'm not a C programmer by any stretch of the imagination - Python is my usual haunt (I did initially start this project in Python/RPi, but very quickly gave up on that!).
Why is such a core language feature so much slower in this situation? As my project becomes more complex, what other optimisations should I be thinking about?
Upvotes: 1
Views: 128
Reputation: 726987
Consider the "payload" operations done inside of your innermost loop:
b
PORTB |= 1 << SDA
vs. PORTB &= ~(1 << SDA)
PORTB
That is all the unrolled version of the loop does; nothing else needs to be done, not even shifting 1
to the left j
times, because the compiler evaluates constant expressions, and 1 << 4
becomes simply 16
.
On the other hand, the loop without unrolling must do additional things to keep looping:
j
j
to 81
left by j
positionsWhen the loop is unrolled, the CPU is no longer burdened with these "non-payload" instructions, so the execution speed goes up.
Why is such a core language feature so much slower in this situation?
Many modern compilers would unroll loops automatically for you, depending on the optimization settings.
Upvotes: 3