SSE SIMD Segmentation Fault when using resulting float

Question

I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?

float proc(float *a, float *b, int n, int c, int width) {
    // Operation: SUM: (A - B) ^ 2
    __m128 A, B, C;
    float total = 0;
    for (int d = 0, k = 0; k < c; d += width, k++) {
        for (int i = 0; i < n / 4 * 4; i += 4) {
            A = _mm_load_ps(&a[i + d]);
            B = _mm_load_ps(&b[i + d]);
            C = _mm_sub_ps(A, B);
            C = _mm_mul_ps(C, C);
            C = _mm_hadd_ps(C, C);
            C = _mm_hadd_ps(C, C);
            total += _mm_cvtss_f32(C); // SEGFAULT HERE
        }
        for (int i = n / 4 * 4; i < n; i++) {
            int diff = a[i + d] - b[i + d];
            total += diff * diff;
        }
    }
    return total;
}

Jason R · Accepted Answer

Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).

I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.

SSE SIMD Segmentation Fault when using resulting float

Answers (1)

Related Questions