user12599964
user12599964

Reputation:

Why accessing non naturally aligned memory is not efficient?

Let's assume we have a 64bit cpu which will always read 8 bytes memory at a time and I want to store a 4 bytes int. According to the definition of natural alignment, a 4-byte object is aligned to an address that's a multiple of 4 (e.g. 0x0000, 0x0004). But here is the problem, why cannot I store it at address 0x0001 for example? To my understanding, since the CPU will always read 8 bytes data, reading from address 0x0000 can still get the int stored at 0x0001 in one go. So, why natural alignment is needed in this case?

Upvotes: 1

Views: 160

Answers (1)

Aki Suihkonen
Aki Suihkonen

Reputation: 20027

Modern CPUs (Intel, Arm) will quite happily read from unaligned addresses. The CPUs are architected typically to read much more than 8 bytes per cycle: perhaps 16 bytes or 32 bytes, and the deep pipelines of the CPUs manage quite nicely to extract the wanted 8 bytes from arbitrary addresses without any visible penalties.

Often, but not always, algorithms can be written without much concern about the alignment of arrays (or the start of each row of 2-dimensional array).

The pipelined architectures possibly read aligned blocks of 16-bytes at a time, meaning that when 8 bytes are read from address 0x0009, the CPU actually needs to read 2 16-byte blocks, combine those and extract the middle 8 bytes. Things become even more complicated, when the memory is not available at first level cache and a full cache line of 64 bytes needs to be fetched from next level cache or from main memory.

In my experience (writing and optimising image processing algorithms for SIMD), many Arm64 implementations hide the cost of loading from unaligned addresses almost perfectly for algorithms with simple and linear memory access. Things become worse, if the algorithm needs to read heavily from many unaligned addresses, such as when filtering with kernel of 3x3 or larger, or when calculating high-radix FFTs, suggesting that the CPUs capabilities of transferring memory and combining the become soon exhausted.

Upvotes: 1

Related Questions