Reputation: 8308
I’m trying to understand STM8 pipelining to be able to predict how much cycles my code will need.
I have this example, where I toggle a GPIO pin for 4 cycles each.
Iff loop
is aligned at 4byte-boundary + 3, the pin stays active for 5 cycles (i.e. one more than it should). I wonder why?
// Switches port D2, 5 cycles high, 4 cycles low
void main(void)
{
__asm
bset 0x5011, #2 ; output mode
bset 0x5012, #2 ; push-pull
bset 0x5013, #2 ; fast switching
jra _loop
.bndry 4
nop
nop
nop
_loop:
nop
bset 0x500f, #2
nop
nop
nop
bres 0x500f, #2
jra _loop
__endasm;
}
A bit more context:
bset
/bres
are 4 byte instructions, nop
1 byte.nop
/bset
/bres
instructions take 1 cycle each.jra
instruction takes two cycles. I think in the first cycle, the instruction cache is filled with the next 32bit value, i.e. in this case the nop
instruction only. And the 2nd cycle is actually just the CPU being stalled while decoding the next instruction.So in cycles:
bres
clears the pinjra
, pipeline flush, nop
fetchnop
decode, bset
fetchnop
execute, bset
decode, next nop
fetchbset
execute sets the pinnop
, bres
fetchnop
nop
, bres
decodebres
execute clears the pinAccording to this, the pin should stay LOW for 4 cycles and HIGH for 4 cycles, but it’s staying HIGH for 5 cycles.
In any other alignment case, the pin is LOW/HIGH for 4 cycles as expected.
I think, if the PIN stays high for an extra cycle that must mean that the execution pipeline is stalled after the bset
instruction (the nop
s thereafter provide enough time to make sure that bres
later is ready to execute immediately). But according to my understanding nop
(for 6.) would already be fetched in 4.
Any idea how this behavior can be explained? I couldn’t find any hints in the manual.
Upvotes: 3
Views: 908
Reputation: 31
It is explained in section 5.4, which basically says that throughout the programming manual, "a simplified convention providing a good match with reality" will be used. From my experience, this simplified convention is indeed a good approximate for a longer sequence, but unusable for exact per-instruction timing, even if you're working on assembly level and control alignment. Take "SLA addr" as an example. It is documented to use 1 cycle. Put three of them in sequence to implement the C equivalent of "*(addr) << 3", and you'll clock up 5-6 cycles.
Actual cycles used for decoding and execution are undocumented. Apart from the obvious reasons, there is no comprehensive documentation about what causes pipeline stalls. I was able to get some insight into this by configuring TIM2 with a prescaler of /1 and reload values of 0xFFFF while using ST-LINK/V2 to step through my code. You can then keep a watch on TIM2_CNTRL to see cycles consumed (== the aggregate value of executing the previous and decoding the current instruction).
Things to keep an eye on are obviously instructions spanning 32-bit boundaries. There were also cases where loading instructions from the next 32-bit word caused an unexpected additional cycle in a sequence of NOPs, suggesting that any fetch (even if not necessary for the current or next instruction) costs 1 cycle? I've seen CALLs to targets aligned to 32 bit boundaries taking 4-7 cycles, suggesting that the CPU was still busy executing the previous instruction or stalling the call for unknown reason. Modifying the SP (push/pop or direct add/sub) seems to be causing stalls under certain conditions.
Any additional insight appreciated!
Upvotes: 2