Reputation: 2824
I'm writing some assembly code for the Cortex-M4, specifically the STM32F407VG found in the STM32F4DISCOVERY kit.
The code is extremely performance-sensitive, so I'm looking to squeeze every last cycle out of it. I have benchmarked it (using the DWT cycle counter available in the Cortex-M4) and for a certain size of input, it runs at 1494 cycles. The code runs from flash, and the CPU is downclocked to 24 MHz to ensure true zero-wait-state accesses to flash (ART Accelerator disabled). Benchmarking two back-to-back reads of the DWT cycle counter results in a single cycle, so that's the sole overhead related to benchmarking.
The code only reads 5 constant 32-bit words from flash (which might cause bus matrix contention if reading both instructions and data from flash); all other data memory accesses are made from/to RAM. I've ensured all branch targets are 32-bit aligned, and manually added .W
suffixes to certain instructions to eliminate all, except for two, 32-bit instructions that are 16- but not 32-bit aligned -- one of which doesn't even run for this input size, and the second is the final POP
instruction of the function, which obviously doesn't run in a loop. Note that the majority of instructions use 32-bit encoding: indeed the average instruction length is 3.74 bytes.
I also made a spreadsheet, accounting for every single instruction of my code, how many times they were run if inside a loop, and even accounting for whether each branch was taken or not taken, since that affects how many cycles a given instruction takes. I read the Cortex-M4 technical reference manual (TRM) to obtain cycle counts for each instruction, and always used the most conservative estimate: where an instruction depends on the cost of a pipeline flush, I assumed it takes the maximum 3 cycles; also, I assumed the worst case for all loads and stores, despite many special cases discussed in section 3.3.2 of the TRM that might actually reduce these counts. My spreadsheet includes the cost of every instruction in between both reads of the DWT cycle counter.
Thus, I was very surprised to learn that my spreadsheet predicts the code should run in 1268 cycles (recall the actual performance is 1494 cycles). I am at a loss to explain why the code runs 18% slower than the supposedly worst case according to instruction timings. Even fully unrolling the main loop of the code (which should be responsible for ~3/4 of the execution time) only brings it down to 1429 cycles -- and quickly adjusting the spreadsheet indicates that this unrolled version should run in 1186 cycles.
What's interesting is that a fully unrolled, carefully tuned C version of the same algorithm runs in 1309 cycles. It has 1013 instructions in total, whereas the fully unrolled version of my assembly code has 930 instructions. In both cases there is some code that handles a case which is not exercised by the particular input used for benchmarking, but there should be no significant differences between the C and assembly versions with regards to this unused code. Finally, the average instruction length of the C code is not significantly smaller: 3.59 cycles.
So: what could possibly be causing this non-trivial discrepancy between predicted and actual performance in my assembly code? What could the C version be possibly doing to run faster, despite a larger number of instructions with similar (a little smaller, but not by much) mix of 16- and 32-bit instructions?
As requested, here is a suitably anonymized minimal reproducible example. Because I isolated a single section of code, the error between prediction and actual measurements decreased to 12.5% for the non-unrolled version (and even less for the unrolled version: 7.6%), but I still consider this a bit high, especially the non-unrolled version, given the simplicity of the core and the use of worst-case timings.
First, the main assembly function:
// #define UNROLL
.cpu cortex-m4
.arch armv7e-m
.fpu softvfp
.syntax unified
.thumb
.macro MACRO r_0, r_1, r_2, d
ldr lr, [r0, #\d]
and \r_0, \r_0, \r_1, ror #11
and \r_0, \r_0, \r_1, ror #11
and lr, \r_0, lr, ror #11
and lr, \r_0, lr, ror #11
and \r_2, \r_2, lr, ror #11
and \r_2, \r_2, lr, ror #11
and \r_1, \r_2, \r_1, ror #11
and \r_1, \r_2, \r_1, ror #11
str lr, [r0, #\d]
.endm
.text
.p2align 2
.global f
f:
push {r4-r11,lr}
ldmia r0, {r1-r12}
.p2align 2
#ifndef UNROLL
mov lr, #25
push.w {lr}
loop:
#else
.rept 25
#endif
MACRO r1, r2, r3, 48
MACRO r4, r5, r6, 52
MACRO r7, r8, r9, 56
MACRO r10, r11, r12, 60
#ifndef UNROLL
ldr lr, [sp]
subs lr, lr, #1
str lr, [sp]
bne loop
add.w sp, sp, #4
#else
.endr
#endif
stmia r0, {r1-r12}
pop {r4-r11,pc}
This is the main code (requires STM32F4 HAL, outputs data via SWO which can be read using ST-Link Utility or the st-trace utility from here, with the command line st-trace -c24
):
#include "stm32f4xx_hal.h"
void SysTick_Handler(void) {
HAL_IncTick();
}
void SystemClock_Config(void) {
RCC_OscInitTypeDef RCC_OscInitStruct;
RCC_ClkInitTypeDef RCC_ClkInitStruct;
// Enable Power Control clock
__HAL_RCC_PWR_CLK_ENABLE();
// The voltage scaling allows optimizing the power consumption when the device is
// clocked below the maximum system frequency, to update the voltage scaling value
// regarding system frequency refer to product datasheet.
__HAL_PWR_VOLTAGESCALING_CONFIG(PWR_REGULATOR_VOLTAGE_SCALE2);
// Enable HSE Oscillator and activate PLL with HSE as source
RCC_OscInitStruct.OscillatorType = RCC_OSCILLATORTYPE_HSE;
RCC_OscInitStruct.HSEState = RCC_HSE_ON; // External 8 MHz xtal on OSC_IN/OSC_OUT
RCC_OscInitStruct.PLL.PLLState = RCC_PLL_ON; // 8 MHz / 8 * 192 / 8 = 24 MHz
RCC_OscInitStruct.PLL.PLLSource = RCC_PLLSOURCE_HSE;
RCC_OscInitStruct.PLL.PLLM = 8; // VCO input clock = 1 MHz / PLLM = 1 MHz
RCC_OscInitStruct.PLL.PLLN = 192; // VCO output clock = VCO input clock * PLLN = 192 MHz
RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV8; // PLLCLK = VCO output clock / PLLP = 24 MHz
RCC_OscInitStruct.PLL.PLLQ = 4; // USB clock = VCO output clock / PLLQ = 48 MHz
if (HAL_RCC_OscConfig(&RCC_OscInitStruct) != HAL_OK) {
while (1)
;
}
// Select PLL as system clock source and configure the HCLK, PCLK1 and PCLK2 clocks dividers
RCC_ClkInitStruct.ClockType = RCC_CLOCKTYPE_SYSCLK | RCC_CLOCKTYPE_HCLK | RCC_CLOCKTYPE_PCLK1 | RCC_CLOCKTYPE_PCLK2;
RCC_ClkInitStruct.SYSCLKSource = RCC_SYSCLKSOURCE_PLLCLK; // 24 MHz
RCC_ClkInitStruct.AHBCLKDivider = RCC_SYSCLK_DIV1; // 24 MHz
RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV1; // 24 MHz
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV1; // 24 MHz
if (HAL_RCC_ClockConfig(&RCC_ClkInitStruct, FLASH_LATENCY_0) != HAL_OK) {
while (1)
;
}
}
void print_cycles(uint32_t cycles) {
uint32_t q = 1000, t;
for (int i = 0; i < 4; i++) {
t = (cycles / q) % 10;
ITM_SendChar('0' + t);
q /= 10;
}
ITM_SendChar('\n');
}
void f(uint32_t *);
int main(void) {
uint32_t x[16];
SystemClock_Config();
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
uint32_t before, after;
while (1) {
__disable_irq();
before = DWT->CYCCNT;
f(x);
after = DWT->CYCCNT;
__enable_irq();
print_cycles(after - before);
HAL_Delay(1000);
}
}
I believe this is enough to dump into a project containing the STM32F4 HAL and run the code. The project needs to add a global #define
for HSE_VALUE=8000000
since the HAL assumes a 25 MHz crystal, rather than the 8 MHz crystal actually fitted to the board.
There is a choice between unrolled and non-unrolled versions by commenting/uncommenting #define UNROLL
at the start of the code.
Running arm-none-eabi-objdump
on the main()
function and looking at the call site:
80009da: 4668 mov r0, sp
before = DWT->CYCCNT;
80009dc: 6865 ldr r5, [r4, #4]
f(x);
80009de: f7ff fbd3 bl 8000188 <f>
after = DWT->CYCCNT;
80009e2: 6860 ldr r0, [r4, #4]
Thus, the only instruction between both reads of the DWT cycle counter is bl
which branches into the f()
assembly function.
The non-unrolled version runs in 1536 cycles, whereas the unrolled version runs in 1356 cycles.
Here is my spreadsheet for the non-unrolled version (not accounting for the already measured 1-cycle overhead of reading the DWT cycle counter):
Instruction | Loop iters | Macro repeats | Count | Cycle count | Total cycles |
---|---|---|---|---|---|
bl (from main) | 1 | 1 | 1 | 4 | 4 |
push (12 regs) | 1 | 1 | 1 | 13 | 13 |
ldmia (12 regs) | 1 | 1 | 1 | 13 | 13 |
mov | 1 | 1 | 1 | 1 | 1 |
push (1 reg) | 1 | 1 | 1 | 2 | 2 |
ldr | 25 | 4 | 1 | 2 | 200 |
and | 25 | 4 | 8 | 1 | 800 |
str | 25 | 4 | 1 | 2 | 200 |
ldr | 1 | 1 | 1 | 2 | 2 |
subs | 1 | 1 | 1 | 1 | 1 |
str | 1 | 1 | 1 | 2 | 2 |
bne (taken) | 24 | 1 | 1 | 4 | 96 |
bne (not taken) | 1 | 1 | 1 | 1 | 1 |
stmia (12 regs) | 1 | 1 | 1 | 13 | 13 |
pop (11 regs + pc) | 1 | 1 | 1 | 16 | 16 |
1364 |
The last column is just the product of the 2nd through 5th columns of the table, and the last row is a sum of all values in the "Total" column. This is the predicted execution time.
Thus, for the non-unrolled version: 1536/(1364 + 1) - 1 = 12.5% error (the + 1 term is to account for the DWT cycle counter overhead).
As for the unrolled version, a few instructions must be removed from the table above: the loop setup (mov
and push (1 reg)
) and the loop increment and branch (ldr
, subs
, str
and bne
, both taken and not taken). This works out to 105 cycles, so the predicted performance would be 1259 cycles.
For the unrolled version, we have 1356/(1259 + 1) - 1 = 7.6% error.
Upvotes: 10
Views: 1158
Reputation: 2824
I put this problem aside for a while, and after a few hours of looking at it with a fresh outlook, I was able to actually beat the worst-case timing predictions shown in the question (emphasizing that they are worst case, so it is not unexpected that they can be beaten). There are two entirely separate issues at play, and I will treat them one at a time.
First of all, as seen in the comments to the existing answers, one trick that I discovered was to map the stack to CCMRAM. However, this never made sense to me, unless going through the STM32F407's bus matrix introduced delays, something I found no evidence for.
It turns out that my hunch was correct: it is possible to achieve full speed without involving CCMRAM. The key is Figure 1 in Section 2 of STM32F407's reference manual:
Additionally, note the following remarks in the Cortex-M4 Technical Reference Manual, Section 2.3.1 ("Bus interfaces"):
System interface
Instruction fetches and data and debug accesses to address ranges 0x20000000 to 0xDFFFFFFF and 0xE0100000 to 0xFFFFFFFF are performed over the 32-bit AHB-Lite bus.
For simultaneous accesses to the 32-bit AHB-Lite bus, the arbitration order in decreasing priority is:
- Data accesses.
- Instruction and vector fetches.
- Debug.
The system bus interface contains control logic to handle unaligned accesses, FPB remapped accesses, bit-band accesses, and pipelined instruction fetches.
Pipelined instruction fetches
To provide a clean timing interface on the system bus, instruction and vector fetch requests to this bus are registered.
This results in an extra cycle of latency because instructions fetched from the system bus take two cycles. This also means that back-to-back instruction fetches from the system bus are not possible.
(Emphasis on the last paragraph by myself.) Note thus that access through the system bus (S-bus in the figure above) take an extra cycle. Also note, by looking at the bus matrix, that there is no connection between the core's D-bus and SRAM2, only SRAM1. As per the STM32F407's reference manual, SRAM2 corresponds to addresses in the range 0x2001C000-0x2001FFFF, i.e. the last 16 KB in the 128 KB block of regular, non-CCM RAM.
Now combine this with the usual technique for initializing the stack pointer in the linker script (relevant sections quoted from a linker script which, to the best of my recollection, comes directly from ST):
MEMORY
{
CCMRAM (xrw) : ORIGIN = 0x10000000, LENGTH = 64K
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 1024K
}
/* Highest address of the user mode stack */
_estack = ORIGIN(RAM) + LENGTH(RAM); /* end of "RAM" Ram type memory */
As written, this ensures the stack pointer will start at 0x20020000, and thus the first 16 KB of the stack will fall directly in SRAM2, which is slower. While this would generally be a sound strategy to avoid stack overflows (statically allocating variables starting from the lowest RAM address, while setting the stack pointer to the highest RAM address, creating the largest possible gap between both), it results in serious performance implications.
Indeed, just by relocating the stack pointer to SRAM1, I was able to reduce the execution time of the MRE in my question from 1536 to 1407 cycles.
The implications of this go beyond the toy example in my question; this should affect every STM32F407 firmware based on the default linker script supplied by ST. What ST did here, when factoring in the lack of a connection between the Cortex-M4 D-bus and SRAM2 in the bus matrix, and the default choice of the stack pointer, borders on criminal/grossly negligent. The amount of performance lost/energy wasted worldwide due to this, considering all shipped STM32F407 units (and possibly many other MCUs affected by this issue), is simply unthinkable. Shame on you ST!
In section 3.3.3 ("Load/store timings") of the Cortex-M4 Technical Reference Manual, a series of considerations are made regarding the pairing of load and store instructions. Quoting the first assertion:
STR Rx,[Ry,#imm]
is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.
Note that the macro in my MRE starts with a load and ends with a store (and this store uses the exact addressing mode referenced above); given these macros are instantiated in sequence, the store at the end of one instance is followed by a load at the beginning of the next one.
My understanding is that this write buffer is enabled by default: see Section 4.4.1 ("Auxiliary control register (ACTLR)"), bit 1 ("DISDEFWBUF") of the STM32 Cortex-M4 programming manual, and note that the reset state of this register is all 0 bits -- in the case of bit 1, the behavior is "Enable write buffer use". Also, I'd like to think the store buffer would clear after a couple of cycles, certainly quicker than the 10+ cycles between one store and the next (from the next instantiation of the macro).
Regardless of this, I decided to experiment with moving the store instruction earlier in the code stream, so that it is not adjacent to a load from the next instantiation of the macro. That is, I rewrote the macro from the MRE in the question to the following:
.macro MACRO r_0, r_1, r_2, d
ldr lr, [r0, #\d]
and \r_0, \r_0, \r_1, ror #11
and \r_0, \r_0, \r_1, ror #11
and lr, \r_0, lr, ror #11
and lr, \r_0, lr, ror #11
and \r_2, \r_2, lr, ror #11
and \r_2, \r_2, lr, ror #11
str lr, [r0, #\d]
and \r_1, \r_2, \r_1, ror #11
and \r_1, \r_2, \r_1, ror #11
.endm
This version reduced the cycle count from 1407 (after applying the fix to issue #1 above) to 1307. That's exactly 100 cycles, and I don't think it's a coincidence that the change above eliminates 100 instances of STR
followed by LDR
. Most importantly, by now I have beaten the original prediction of 1364 cycles from the table in the question, so at least I have reached (and indeed improved) the worst case. On the other hand, given the quote above on how STR Rx,[Ry,#imm]
should always take one cycle, maybe a better estimate would be 1264 cycles, so there are still 43 cycles of difference left to explain. If anyone can further improve either the predictions or to code to reach this conjectured 1264 cycle bound, I'd be very interested to know.
Finally, this SO question and its answers may contain relevant information. I'll reread it a couple of times over the next few days to see if it provides further insights.
Upvotes: 4
Reputation: 71586
you are making assumptions about overall timing based on instruction timing in the document. The processor has not been driving performance for a long time now.
you have memory accesses in your test. 2a) you have aligned and unaligned memory accesses in your test
Pretty sure the ART is on, I have tried many times to turn it off. Maybe it was a cortex-m7 that I could at least get one pass with it off or something, cannot remember. Need to run from sram not flash.
zero wait states does not mean zero wait states flash is often a couple of clocks per if not more (with zero EXTRA wait states). Difficult to impossible to determine on STM32 parts. ti and others that do not have this flash cache (ART) performance is much easier to see.
and other stuff.
I do not know what you mean by nops related to normal thumb instructions and forcing thumb2 extensions. Where are these nops?
Excellent work BTW, I am not dismissing that in any way. Just wanted to add some extra info that I cannot tell if you timed or not, since your tests are definitely touching system timing issues and are beyond the instruction timing.
So the ARM ARM and ARM TRM for the cortex-m4.
Instruction fetches from Code memory space, 0x00000000 to 0x1FFFFFFC , are performed over the 32-bit AHB-Lite bus.
All fetches are word-wide. The number of instructions fetched per word depends on the code running and the alignment of the code in memory.
Well instructions are either one halfword or two so 16 or 32 bits total, and we can use that information to cause a performance hit (especially if you force all instructions to be thumb2 extensions)
I can provide complete 100% sources in this answer since I use no libraries in my test. Processor is slow enough to have "zero wait states" on the flash, running 8Mhz off the crystal only to let the uart that prints results be more accurate, otherwise the internal clock is fine. NUCLEO-F411RE so should be the same m4 core that they purchased for the F4 discovery. I have some of those original f4 discoveries laying around here somewhere as well as a few of the cheap clones, but the nucleo is so much easier and had it laying nearby.
Most of the time and certainly in this case you do not need to mess with the DWT cycle count as the systick gives the same answer, some implementations (other vendors if any) may divide the system clock into the systick (if there is a systick)(might not be a dwt either) but not in this case, I get the same results with either and systick is slightly easier so...
ldr r2,[r0]
loop:
subs r1,#1
bne loop
ldr r3,[r0]
subs r0,r2,r3
bx lr
start with a simple loop pass in the timer register (systick in this case, swap r2,r3 if dwt cycle count) to measure right around the loop under test.
hexstring(STK_MASK&TEST(STK_CVR,0x1000));
hexstring(STK_MASK&TEST(STK_CVR,0x1000));
800011e: 6802 ldr r2, [r0, #0]
08000120 <loop>:
8000120: f1b1 0101 subs.w r1, r1, #1
8000124: f47f affc bne.w 8000120 <loop>
8000128: 6803 ldr r3, [r0, #0]
800012a: 1ad0 subs r0, r2, r3
800012c: 4770 bx lr
800012e: bf00 nop
00003001
00003001
thumb2 extensions, the loop itself is aligned (on an 8 word boundary).
800011e: 6802 ldr r2, [r0, #0]
08000120 <loop>:
8000120: 3901 subs r1, #1
8000122: d1fd bne.n 8000120 <loop>
8000124: 6803 ldr r3, [r0, #0]
8000126: 1ad0 subs r0, r2, r3
8000128: 4770 bx lr
800012a: bf00 nop
00003001
00003001
Thumb instructions, it doesn't matter at this point:
8000120: 6802 ldr r2, [r0, #0]
08000122 <loop>:
8000122: 3901 subs r1, #1
8000124: d1fd bne.n 8000122 <loop>
8000126: 6803 ldr r3, [r0, #0]
8000128: 1ad0 subs r0, r2, r3
800012a: 4770 bx lr
00003001
00003001
change the alignment by a halfword, thumb instructions, does not change results
8000120: 6802 ldr r2, [r0, #0]
08000122 <loop>:
8000122: f1b1 0101 subs.w r1, r1, #1
8000126: f47f affc bne.w 8000122 <loop>
800012a: 6803 ldr r3, [r0, #0]
800012c: 1ad0 subs r0, r2, r3
800012e: 4770 bx lr
00004000
00004000
thumb2 extensions unaligned, we see the extra fetch, or assume it is the extra fetch.
I have not been able to turn off the ART in the years since the STM32s came out. The prefetch bit in the flash acr does not affect the results here. Let's run from sram as well as flash.
800011e: 6802 ldr r2, [r0, #0]
08000120 <loop>:
8000120: f1b1 0101 subs.w r1, r1, #1
8000124: f47f affc bne.w 8000120 <loop>
8000128: 6803 ldr r3, [r0, #0]
800012a: 1ad0 subs r0, r2, r3
800012c: 4770 bx lr
00003001 flash
00003001
00005FFF sram
00005FFF
thumb2 extensions, aligned.
8000120: 6802 ldr r2, [r0, #0]
08000122 <loop>:
8000122: f1b1 0101 subs.w r1, r1, #1
8000126: f47f affc bne.w 8000122 <loop>
800012a: 6803 ldr r3, [r0, #0]
800012c: 1ad0 subs r0, r2, r3
800012e: 4770 bx lr
00004000 flash
00004000
00007FFD sram
00007FFD
thumb2 extensions, unaligned, we see what is assumed to be that extra fetch.
8000120: 6802 ldr r2, [r0, #0]
08000122 <loop>:
8000122: 3901 subs r1, #1
8000124: d1fd bne.n 8000122 <loop>
8000126: 6803 ldr r3, [r0, #0]
8000128: 1ad0 subs r0, r2, r3
800012a: 4770 bx lr
00003001
00003001
00005FFD
00005FFD
thumb, unaligned
800011e: 6802 ldr r2, [r0, #0]
08000120 <loop>:
8000120: 3901 subs r1, #1
8000122: d1fd bne.n 8000120 <loop>
8000124: 6803 ldr r3, [r0, #0]
8000126: 1ad0 subs r0, r2, r3
8000128: 4770 bx lr
00003001
00003001
00004001
00004001
thumb aligned, that is very interesting. we will see that later
subs 1
bne taken 4
bne not taken 1
subs 0x1000 0x1000 0x1000
bne taken 0x0FFF 0x1FFE up to 0x3FFC
bne not taken 0x0001 0x0001 0x0001
========== =======
0x2FFF 0x4FFD
your test has a lot of stuff in it that I think was not needed, and you had aligned and unaligned loads and stores mixed in I separated those out, I took a portion of your test...
800021c: b570 push {r4, r5, r6, lr}
800021e: 6802 ldr r2, [r0, #0]
08000220 <loop2>:
8000220: ea04 24f5 and.w r4, r4, r5, ror #11
8000224: ea04 24f5 and.w r4, r4, r5, ror #11
8000228: ea04 2efe and.w lr, r4, lr, ror #11
800022c: ea04 2efe and.w lr, r4, lr, ror #11
8000230: ea06 26fe and.w r6, r6, lr, ror #11
8000234: ea06 26fe and.w r6, r6, lr, ror #11
8000238: ea06 25f5 and.w r5, r6, r5, ror #11
800023c: ea06 25f5 and.w r5, r6, r5, ror #11
8000240: 3901 subs r1, #1
8000242: d1ed bne.n 8000220 <loop2>
8000244: 6803 ldr r3, [r0, #0]
8000246: 1ad0 subs r0, r2, r3
8000248: e8bd 4070 ldmia.w sp!, {r4, r5, r6, lr}
800024c: 4770 bx lr
0000B001
0000B001
00013FFE
00013FFE
your test is all thumb2 extensions (well three register and with rotation no doubt). aligned.
800021c: b570 push {r4, r5, r6, lr}
800021e: 6802 ldr r2, [r0, #0]
08000220 <loop2>:
8000220: ea04 24f5 and.w r4, r4, r5, ror #11
8000224: ea04 24f5 and.w r4, r4, r5, ror #11
8000228: ea04 2efe and.w lr, r4, lr, ror #11
800022c: ea04 2efe and.w lr, r4, lr, ror #11
8000230: ea06 26fe and.w r6, r6, lr, ror #11
8000234: ea06 26fe and.w r6, r6, lr, ror #11
8000238: ea06 25f5 and.w r5, r6, r5, ror #11
800023c: ea06 25f5 and.w r5, r6, r5, ror #11
8000240: 3901 subs r1, #1
8000242: d1ed bne.n 8000220 <loop2>
8000244: 6803 ldr r3, [r0, #0]
8000246: 1ad0 subs r0, r2, r3
8000248: e8bd 4070 ldmia.w sp!, {r4, r5, r6, lr}
800024c: 4770 bx lr
800024e: bf00 nop
0000C001
0000C001
00015FFD
00015FFD
unaligned so we do not see an extra fetch (assuming that is what it is) per instruction, just one for the whole loop. Which further reinforces that is an extra fetch due to alignment.
8000220: b570 push {r4, r5, r6, lr}
8000222: 6802 ldr r2, [r0, #0]
08000224 <loop2>:
8000224: ea04 24f5 and.w r4, r4, r5, ror #11
8000228: ea04 24f5 and.w r4, r4, r5, ror #11
800022c: ea04 2efe and.w lr, r4, lr, ror #11
8000230: ea04 2efe and.w lr, r4, lr, ror #11
8000234: ea06 26fe and.w r6, r6, lr, ror #11
8000238: ea06 26fe and.w r6, r6, lr, ror #11
800023c: ea06 25f5 and.w r5, r6, r5, ror #11
8000240: ea06 25f5 and.w r5, r6, r5, ror #11
8000244: 3901 subs r1, #1
8000246: d1ed bne.n 8000224 <loop2>
8000248: 6803 ldr r3, [r0, #0]
800024a: 1ad0 subs r0, r2, r3
800024c: e8bd 4070 ldmia.w sp!, {r4, r5, r6, lr}
8000250: 4770 bx lr
8000252: bf00 nop
0000B001
0000B001
00013FFE
00013FFE
I went ahead and moved it one more halfword to only a 1 word alignment instead of 8 word. Maybe the ART would be affected, but not expecting sram to change. Neither were effected (on bigger processors like the full sized arms this would have a different result as the fetches are like 4 or 8 words at a time and you have a lot of alignment plus branch prediction sensitive spots that cause multiple different performance numbers for the same machine code).
You had some loads and stores, and unless I read the code wrong you had an array of 16 words but did not initialize them. Yet used them. This is not floating point nor multiply/divide so do not expect any clock savings based on data content. I guess you were not exceeding the stack/this array as I might have mentioned at the top of this answer...
8000318: b430 push {r4, r5}
800031a: f04f 5500 mov.w r5, #536870912 ; 0x20000000
800031e: 6802 ldr r2, [r0, #0]
08000320 <loop3>:
8000320: 686c ldr r4, [r5, #4]
8000322: 606c str r4, [r5, #4]
8000324: 3901 subs r1, #1
8000326: d1fb bne.n 8000320 <loop3>
8000328: 6803 ldr r3, [r0, #0]
800032a: 1ad0 subs r0, r2, r3
800032c: bc30 pop {r4, r5}
800032e: 4770 bx lr
00005001
00005001
00008FFE
00008FFE
Nice and pretty and aligned. This is our baseline.
8000318: b430 push {r4, r5}
800031a: f04f 5500 mov.w r5, #536870912 ; 0x20000000
800031e: 6802 ldr r2, [r0, #0]
08000320 <loop3>:
8000320: f8d5 4005 ldr.w r4, [r5, #5]
8000324: f8c5 4005 str.w r4, [r5, #5]
8000328: 3901 subs r1, #1
800032a: d1f9 bne.n 8000320 <loop3>
800032c: 6803 ldr r3, [r0, #0]
800032e: 1ad0 subs r0, r2, r3
8000330: bc30 pop {r4, r5}
8000332: 4770 bx lr
0000A001
0000A001
0000FFFF
0000FFFF
Unalign by one byte, if the core even supports it, if the trap is disabled, etc etc etc. Takes longer as expected. Considerably longer, can start to feel how long the sram cycles are from these tests.
8000318: b430 push {r4, r5}
800031a: f04f 5500 mov.w r5, #536870912 ; 0x20000000
800031e: 6802 ldr r2, [r0, #0]
08000320 <loop3>:
8000320: f8d5 4006 ldr.w r4, [r5, #6]
8000324: f8c5 4006 str.w r4, [r5, #6]
8000328: 3901 subs r1, #1
800032a: d1f9 bne.n 8000320 <loop3>
800032c: 6803 ldr r3, [r0, #0]
800032e: 1ad0 subs r0, r2, r3
8000330: bc30 pop {r4, r5}
8000332: 4770 bx lr
00008001
00008001
0000DFFF
0000DFFF
halfword aligned but not word aligned, word cycles. This is very interesting, this is not expected. Have to check the documentation.
The processor provides three primary bus interfaces implementing a variant of the AMBA 3 AHB-Lite protocol
Data and debug accesses to Code memory space, 0x00000000 to 0x1FFFFFFF , are performed over the 32-bit AHB-Lite bus.
So it is 32 bit from the ARM side but the chip vendor can do whatever they want so maybe their sram is built from 16 bit wide blocks, who knows.
8000318: b430 push {r4, r5}
800031a: f04f 5500 mov.w r5, #536870912 ; 0x20000000
800031e: 6802 ldr r2, [r0, #0]
08000320 <loop3>:
8000320: f8d5 4007 ldr.w r4, [r5, #7]
8000324: f8c5 4007 str.w r4, [r5, #7]
8000328: 3901 subs r1, #1
800032a: d1f9 bne.n 8000320 <loop3>
800032c: 6803 ldr r3, [r0, #0]
800032e: 1ad0 subs r0, r2, r3
8000330: bc30 pop {r4, r5}
8000332: 4770 bx lr
0000A001
0000A001
0000FFFF
0000FFFF
Now as expected, this alignment is also much worse than being properly aligned.
MACRO r1, r2, r3, 48 aligned
MACRO r4, r5, r6, 52 unaligned
MACRO r7, r8, r9, 56 unaligned
MACRO r10, r11, r12, 60 aligned
these unaligned accesses are going to create extra clocks. among possible other things.
8000318: b430 push {r4, r5}
800031a: f04f 5500 mov.w r5, #536870912 ; 0x20000000
800031e: 6802 ldr r2, [r0, #0]
08000320 <loop3>:
8000320: f3af 8000 nop.w
8000324: f3af 8000 nop.w
8000328: 3901 subs r1, #1
800032a: d1f9 bne.n 8000320 <loop3>
800032c: 6803 ldr r3, [r0, #0]
800032e: 1ad0 subs r0, r2, r3
8000330: bc30 pop {r4, r5}
8000332: 4770 bx lr
00005000
00005000
00007FFF
00007FFF
vs
00005001
00005001
00008FFE
00008FFE
nops instead of ldr/str. Not necessarily helping us with the measurement of the ldr/str instructions. But I do not see it as being a fixed 2 instructions for each all the time.
Now obviously compiled code is going to take advantage of the thumb instructions when it can. Creating a mixture of thumb and thumb2, ideally mostly thumb. So it will be or can be fewer fetches for the same number of instructions. Unrolling of course saves you number of loops times some number of clocks (oh, right I tried BPIALL and I saw no effect, I think in the -m7 you can mess with branch prediction, if there even is any in the m4 or m3, etc)(can definitely see it in the full sized arms and other processors combined with alignment again doubles the different performance measurements for the same machine code)(net result benchmarks are BS, and cannot count instructions and figure out clocks for the last couple of decades or so) so you will save those extra loop branching clocks. Linear code with no branches even with extra instructions is often going to be the fastest.
I am not going to completely repeat your experiment as written. I think I have provided some info to chew on and certainly I think your ldr/str timing is wrong. I do not believe it is 2 clocks per instruction in all cases. (you are also pushing/popping your loop counter against memory causing possibly an extra uncounted clock or few per loop). I also think that the ART is on and cannot be turned off so you are getting some slow flash plus their prefetch cache thing feeding the core which makes measurements like this that much more difficult to control/understand. While ti and nxp may have purchased different revisions of the m4 (I have not looked in a while to see if arm even released more than one) and there are always vendor customizations. I do remember that the ti does not have a magic flash cache like st. It may have an actual data cache implemented that makes the above even more fun multiplying again the performance measurements on the same machine code. But you may get a feel for what m4s in a different system do compared to your expectations. I think part of the problem is the expectations and that in part has to do with for many platforms we have not been able to count clocks from instructions for decades and the system itself plays a big role in performance over and above the processor. mcus are cheap and fast enough and not necessarily high performance machines (not that our desktops are either) the nature of the modern buses which are very much not one cycle per anything, combined with a pipeline, fetching alone creates often unmeasurable chaos. Before others chime in I will agree that so far on these cortex-m platforms a specific binary build, without things like interrupts/etc getting in the way, the performance of a binary is consistent if you do not change any variables. But you can recompile that program with what appears to be something that has nothing to do with anything, could be in a file not even related to the code it effects, and see a dramatic performance difference with the next build.
Unaligned ldr/strs alone can easily account for the 200 clock count difference.
Bottom line, the processor is only part of the system, we are not (completely) processor bound so its timing does not determine performance (cannot no longer use/rely on instruction timing documentation). I think as a result of that there is some expectation issues and there are some extra clocks sneaking in here and there, one or two digits percent worth of performance expectations from system issues and not processor issues.
The C compiler using thumb and thumb2 extensions even if the same number of instructions may or may not execute faster, but you do have fewer fetches to bury in the pipe or stall the pipe. Compared to forcing one instruction per fetch.
Based on your comment, using SYSCFG_MEMRMP (thanks for educating me on this register).
A particular test
00003001 flash
00003001 flash
00004001 sram
00004001 sram
00003001 sram through icode bus
00003001 sram through icode bus
so it works and thanks for the info. Won't go through this whole answer again, but good to know for the future.
Upvotes: 5
Reputation: 68013
For the maximum performance you need to convert your Cortex mico into a Harvard architecture machine.
Code memory and data memory are accessed without racing on the busses.
You will get maximum performance.
BTW Address 0 is just remapped from the memory chosen to boot from. It does not have specific bus "connected" to it.
Upvotes: 0