rumpel
rumpel

Reputation: 8308

Understanding STM8 pipelining

I’m trying to understand STM8 pipelining to be able to predict how much cycles my code will need.

I have this example, where I toggle a GPIO pin for 4 cycles each. Iff loop is aligned at 4byte-boundary + 3, the pin stays active for 5 cycles (i.e. one more than it should). I wonder why?

// Switches port D2, 5 cycles high, 4 cycles low
void main(void)
{
    __asm
        bset 0x5011, #2 ; output mode
        bset 0x5012, #2 ; push-pull
        bset 0x5013, #2 ; fast switching

        jra _loop
    .bndry 4
        nop
        nop
        nop
    _loop:
        nop
        bset 0x500f, #2
        nop
        nop
        nop
        bres 0x500f, #2
        jra _loop
    __endasm;
}

A bit more context:

So in cycles:

  1. bres clears the pin
  2. jra, pipeline flush, nop fetch
  3. nop decode, bset fetch
  4. nop execute, bset decode, next nop fetch
  5. bset execute sets the pin
  6. nop, bres fetch
  7. nop
  8. nop, bres decode
  9. bres execute clears the pin

According to this, the pin should stay LOW for 4 cycles and HIGH for 4 cycles, but it’s staying HIGH for 5 cycles.

In any other alignment case, the pin is LOW/HIGH for 4 cycles as expected.

I think, if the PIN stays high for an extra cycle that must mean that the execution pipeline is stalled after the bset instruction (the nops thereafter provide enough time to make sure that bres later is ready to execute immediately). But according to my understanding nop (for 6.) would already be fetched in 4.

Any idea how this behavior can be explained? I couldn’t find any hints in the manual.

Upvotes: 3

Views: 908

Answers (1)

Jens
Jens

Reputation: 31

It is explained in section 5.4, which basically says that throughout the programming manual, "a simplified convention providing a good match with reality" will be used. From my experience, this simplified convention is indeed a good approximate for a longer sequence, but unusable for exact per-instruction timing, even if you're working on assembly level and control alignment. Take "SLA addr" as an example. It is documented to use 1 cycle. Put three of them in sequence to implement the C equivalent of "*(addr) << 3", and you'll clock up 5-6 cycles.

Actual cycles used for decoding and execution are undocumented. Apart from the obvious reasons, there is no comprehensive documentation about what causes pipeline stalls. I was able to get some insight into this by configuring TIM2 with a prescaler of /1 and reload values of 0xFFFF while using ST-LINK/V2 to step through my code. You can then keep a watch on TIM2_CNTRL to see cycles consumed (== the aggregate value of executing the previous and decoding the current instruction).

Things to keep an eye on are obviously instructions spanning 32-bit boundaries. There were also cases where loading instructions from the next 32-bit word caused an unexpected additional cycle in a sequence of NOPs, suggesting that any fetch (even if not necessary for the current or next instruction) costs 1 cycle? I've seen CALLs to targets aligned to 32 bit boundaries taking 4-7 cycles, suggesting that the CPU was still busy executing the previous instruction or stalling the call for unknown reason. Modifying the SP (push/pop or direct add/sub) seems to be causing stalls under certain conditions.

Any additional insight appreciated!

Upvotes: 2

Related Questions