Reputation: 142
I'm trying to use Intel Intrinsics to perform an operation quickly on a float
array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C)
that my program crashes. Any ideas?
float proc(float *a, float *b, int n, int c, int width) {
// Operation: SUM: (A - B) ^ 2
__m128 A, B, C;
float total = 0;
for (int d = 0, k = 0; k < c; d += width, k++) {
for (int i = 0; i < n / 4 * 4; i += 4) {
A = _mm_load_ps(&a[i + d]);
B = _mm_load_ps(&b[i + d]);
C = _mm_sub_ps(A, B);
C = _mm_mul_ps(C, C);
C = _mm_hadd_ps(C, C);
C = _mm_hadd_ps(C, C);
total += _mm_cvtss_f32(C); // SEGFAULT HERE
}
for (int i = n / 4 * 4; i < n; i++) {
int diff = a[i + d] - b[i + d];
total += diff * diff;
}
}
return total;
}
Upvotes: 0
Views: 432
Reputation: 11758
Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).
I mentioned in my original comment that movaps
has a shorter encoding than movups
. This is not correct. I was thinking instead of movaps
versus movapd
, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps
has a shorter encoding.
Upvotes: 1