Reputation: 2718
I try to understand what happens physically on the data bus when a STM32H7 (Cortex M7) is executing a LDRB instruction (assuming the caches are disabled, to simplify). Is there a 32 bits access to the memory and 3 out of 4 bytes are trashed ? Does it depend on the type of memory ? If the code is doing four LDRB on consecutive addresses, how does it compare (in terms of number of cycles) to doing a single 32 bits LDR ?
Upvotes: 2
Views: 639
Reputation: 1789
Cortex-M7 has a 64-bit AMBA4 AXI interface.
This is only part of the answer since this data bus will connect to a memory of the STM32H7 somewhere, but we can assume that memory has an interface that is at least as wide as the bus. The memory controller will most likely read the full width from the memory (but maybe not at core frequency).
The read data will be returned on the bus, occupying the read channel for however many cycles the handshake takes. For a byte read, the data returned should be a byte.
Performing 4 byte reads could avoid the external memory access, but keeps the bus busy for 4 transfers. The bus can support multiple outstanding transfers (limited by the chip design, not the processor). Architecturally, the processor is permitted to merge the transfers (but this would naturally be done by the cache, which you have disabled).
At a first order approximation, you can load 8 32 bit registers in the same number of cycles as performing 4 byte reads, since there is a 64 bit AXI. Actually, it can be faster because you can use a single LDM
instruction rather than 4 LDRB
, and instruction fetches share the same bus.
It should be noted that stores are potentially more complex because it is harder to build the logic to ignore partial write data, and fairly easy to merge writes.
(This is a 'generic' answer rather than a reflection of the M7 micro-architecture, you need to do your own benchmarking to understand the detailed implications of your question).
Upvotes: 2