SSE Intrinsics arithmetic error

Question

I've been experimenting with SSE intrinsics and I seem to have run into a weird bug that I can't figure out. I am computing the inner product of two float arrays, 4 elements at a time.

For testing I've set each element of both arrays to 1, so the product should be == size.

It runs correctly, but whenever I run the code with size > ~68000000 the code using the sse intrinsics starts computing the wrong inner product. It seems to get stuck at a certain sum and never exceeds this number. Here is an example run:

joe:~$./test_sse 70000000
sequential inner product: 70000000.000000
sse        inner product: 67108864.000000
sequential          time: 0.417932
sse                 time: 0.274255

Compilation:

gcc -fopenmp test_sse.c -o test_sse -std=c99

This error seems to be consistent amongst the handful of computers I've tested it on. Here is the code, perhaps someone might be able to help me figure out what is going on:

#include 
#include 
#include 
#include 
#include 
#include 

#include 

double inner_product_sequential(float * a, float * b, unsigned int size) {

  double sum = 0;

  for(unsigned int i = 0; i < size; i++) {
    sum += a[i] * b[i];
  }

  return sum;

}

double inner_product_sse(float * a, float * b, unsigned int size) {

  assert(size % 4 == 0);

  __m128 X, Y, Z;

  Z = _mm_set1_ps(0.0f);

  float arr[4] __attribute__((aligned(sizeof(float) * 4)));

  for(unsigned int i = 0; i < size; i += 4) {

    X = _mm_load_ps(a+i);
    Y = _mm_load_ps(b+i);

    X = _mm_mul_ps(X, Y);
    Z = _mm_add_ps(X, Z);

  }

  _mm_store_ps(arr, Z);

  return arr[0] + arr[1] + arr[2] + arr[3];

}

int main(int argc, char ** argv) {

  if(argc < 2) {
    fprintf(stderr, "usage: ./test_sse 
");
    exit(EXIT_FAILURE);
  }

  unsigned int size = atoi(argv[1]);

  srand(time(0));

  float *a = (float *) _mm_malloc(size * sizeof(float), sizeof(float) * 4);
  float *b = (float *) _mm_malloc(size * sizeof(float), sizeof(float) * 4);

  for(int i = 0; i < size; i++) {
    a[i] = b[i] = 1;
  }



  double start, time_seq, time_sse;


  start = omp_get_wtime();

  double inner_seq = inner_product_sequential(a, b, size);

  time_seq = omp_get_wtime() - start;


  start = omp_get_wtime();

  double inner_sse = inner_product_sse(a, b, size);

  time_sse = omp_get_wtime() - start;


  printf("sequential inner product: %f
", inner_seq);
  printf("sse        inner product: %f
", inner_sse);
  printf("sequential          time: %f
", time_seq);
  printf("sse                 time: %f
", time_sse);




  _mm_free(a);
  _mm_free(b);
}

Christoph Freundl · Accepted Answer

You are running into the precision limit of single precision floating point numbers. The number 16777216 (2^24), which is the value of each component of the vector Z when reaching the "limit" inner product, is represented in 32-bit floating point as hexadecimal 0x4b800000 or binary 0 10010111 00000000000000000000000, i.e. the 23-bit mantissa is all zeros (implicit leading 1 bit), and the 8-bit exponent part is 151 representing the exponent 151 - 127 = 24. If you add a 1 to that value this would require to increase the exponent but then the added one cannot be represented in the mantissa any longer, so in single precision floating point arithmetic 2^24 + 1 = 2^24.

You do not see that in your sequential function because there you are using a 64-bit double precision value to store the result, and as we are working on a x86 platform, internally most probably an 80-bit excess precision register is used.

You can force to use single precision throughout in your sequential code by rewriting it as

float sum;

float inner_product_sequential(float * a, float * b, unsigned int size) {
  sum = 0;

  for(unsigned int i = 0; i < size; i++) {
    sum += a[i] * b[i];
  }

  return sum;
}

and you will see 16777216.000000 as maximum computed value.

SSE Intrinsics arithmetic error

Answers (1)

Related Questions