Reputation: 7963
Often we hear that aligning our variables to an N-byte boundary in memory can improve performance (by preventing the CPU from having to load two separate 'words' in to the cache to read our variable).
On the other hand, we also hear (less often) that aligning a large block of memory (array/buffer) to a nice, round power-of-two address can be bad because the hashing function that allocates memory addresses to cache addresses is no longer uniform (this is called page alignment).
Therefore, my question is, is there some sort of rule or threshold point about when we should deliberately misalign data to avoid the problem of page alignment; and when not to, to gain the benefits of standard memory alignment?
Upvotes: 0
Views: 1259
Reputation: 3350
If performance is particularity critical to your application, and your application is typically iterating over known datasets (in type and size), then it is important to know and understand the effects of MMUs, L-caches, and cache lines. Not because you can really avoid those problems ahead-of-time, but because you might need to identify them after-the-fact while staring at profiling results and trying to puzzle out why something took way longer than it used to or "normally should." And -- if you're lucky and the dataset is enough within your control -- you can then tweak things to resolve some sort of cpu cache performance issue.
Unfortunately most applications don't have the luxury of iterating over known datasets, and knowing their target hardware type. That is something that is fairly exclusive to game and multimedia application development, and also operating systems engineering. For most of the rest of the world's applications, improving the cache profile for one particular dataset of some particular size means decreasing it for another.
Finally, even the 'rule of thumb' about "aligning our variables to an N-byte boundary" is subject to underlying hardware. Most newer desktop-grade x86 architectures (most made after roughly 2011) prefer packed data rather than aligned data, because the cost of loading words packed across a cacheline boundary has become cheaper than having to load more total cachelines to represent the same dataset. But on a mobile device running ARM? Alignment is still pretty critical.
More keywords for you to search on, for further education: cache coloring and cache eviction. But again, this is all very dependent on target CPUs and there are unfortunately few (or no) generalizations to be had.
Upvotes: 3
Reputation: 44274
I don't think you can get a general rule for this. It depends on the processor you are using, i.e. the MMU and cache implementation of the underlying system. That will differ from system to system. So if you want top-performance, you'll need to understand all the low level details of your current system. In general I would expect that the benefit of aligning large memory blocks to a power-of-two boundary is limited.
Upvotes: 2