Reputation: 16035
I am trying to accelerate my code using SSE, and the following code works well.
Basically a __m128
variable should point to 4 floats in a row, in order to do 4 operations at once.
This code is equivalent to computing c[i]=a[i]+b[i]
with i
from 0
to 3
.
float *data1,*data2,*data3
// ... code ... allocating data1-2-3 which are very long.
__m128* a = (__m128*) (data1);
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
However, when I want to shift a bit the data that I use (see below), in order to compute c[i]=a[i+1]+b[i]
with i
from 0
to 3
, it crashes at execution time.
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
My guess is that it is related to the fact that __m128
is 128 bits and by float
data are 32 bits. So, it may be impossible for a 128-bit pointer to point on an address that is not divisible by 128.
Anyway, do you know what the problem is and how I could go around it?
Upvotes: 1
Views: 256
Reputation: 212929
Instead of using implicit aligned loads/stores like this:
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
use explicit aligned/unaligned loads/stores as appropriate, e.g.:
__m128 va = _mm_loadu_ps(data1+1); // <-- +1 (NB: use unaligned load)
__m128 vb = _mm_load_ps(data2);
__m128 vc = _mm_add_ps(va, vb);
_mm_store_ps(data3, vc);
Same amount of code (i.e. same number of instructions), but it won't crash, and you have explicit control over which loads/stores are aligned and which are unaligned.
Note that recent CPUs have relatively small penalties for unaligned loads, but on older CPUs there can be a 2x or greater hit.
Upvotes: 6
Reputation: 11028
Your problem here is that a
ends up pointing to something that is not a __m128
; it points to something that contains the last 96 bits of an __m128
and 32 bits outside, which can be anything. It may be the first 32 bits of the next __m128
, but eventually, when you arrive at the last __m128
in the same memory block, it will be something else. Maybe reserved memory that you cannot access, hence the crash.
Upvotes: 0