Reputation: 3072
I'm starting to learn a little bit about SIMD intrinsics. I noticed that for some functions there is an aligned and an unaligned version, for example _mm_store_si128
and _mm_storeu_si128
. My question is, do these functions perform differently, and if not why are there two different versions?
Upvotes: 3
Views: 2142
Reputation: 49289
I'd say "always align (wherever possible)", this way you are covered no matter what. Some platforms do not support unaligned access, others will have substantial performance degradation. If you go for aligned access you will have optimal performance in any case. There might be a small cost of memory on some platforms, but it is well worth it, because if you go SIMD that means you go for performance. I can think of no reason why one should implement unaligned code path. Maybe if you have to deal with some old design, which wasn't built with SIDM in mind, but I'd say the odds of that are slim to none.
I'd say the same applies to scalars as well, proper alignment is proper in any case, and saves you some trouble when achieving optimal performance...
As of why unaligned access might be slower or even unsupported - it is because of how hardware works. Say you have a 64bit integer, and a 64bit memory controller, if your integer is properly aligned, the memory controller can access it in a single swoop. But if it is offset, the memory controller will have to do 2 operations, plus the CPU may need to shift data around to compose it properly. And since that is suboptimal, some platforms don't even support it implicitly, as the means to enforce efficiency.
Upvotes: 2
Reputation: 364180
If the data is in fact aligned, an unaligned load / store will be identical performance to an aligned store.
unaligned ops: Unaligned data will cause a small performance hit, but your program still works.
aligned ops: Unaligned data will cause a fault, letting you detect accidentally-unaligned data instead of silently causing a performance hit.
Modern CPUs have very good support for unaligned loads, but there's still a significant performance hit when a load crosses a cache-line boundary.
When using SSE, aligned loads can be folded into other operations as a memory operand. This improves code size and throughput slightly.
When using AVX, both kinds of loads can be folded into other operations. (AVX default behaviour is to allow unaligned memory operands). If aligned loads don't get folded, and produce a movdqa
or movaps
, then they will still fault on unaligned addresses. This applies even to VEX-encoding of 128bit ops, which you get with the right compile options with no source changes to code using 128b intrinsics.
For getting started with intrinsics, I'd suggest always using unaligned load/store intrinsics. (But try to have your data aligned at least in the common case). Use aligned when performance tuning if you're worried that unaligned data is causing a problem.
Upvotes: 2
Reputation: 212969
On older CPUs there is a substantial performance difference between aligned and unaligned loads/stores. On more recent CPUs the difference is much less significant, but as a "rule of thumb" you should still prefer the aligned version wherever possible.
Upvotes: 2