Why is ARM NEON SIMD Sum is slower than serial sum?

Question

I have been trying to benchmark memory bandwidth on my M2 Mac and one thing I noticed that when I try to use ARM NEON SIMD it seems that it has a slower processing time and lower memory bandwidth. This is my code:

int arm_neon_simd_sum(const std::vector& arr) {
    // Check if the array is empty
    if (arr.empty()) return 0;

    int32x4_t v_sum = vdupq_n_s32(0); // Initialize NEON vector to hold partial sums
    size_t i = 0;

    // Process the array in chunks of 16 elements
    for (; i + 15 < arr.size(); i += 16) {
        int8x16_t v_data = vld1q_s8(&arr[i]); // Load 16 elements into a NEON register

        // Convert int8x16_t to int16x8_t
        int16x8_t v_data_low = vmovl_s8(vget_low_s8(v_data)); // Lower 8 elements to int16
        int16x8_t v_data_high = vmovl_s8(vget_high_s8(v_data)); // Upper 8 elements to int16

        // Convert int16x8_t to int32x4_t and accumulate
        v_sum = vaddq_s32(v_sum, vpaddlq_s16(v_data_low));
        v_sum = vaddq_s32(v_sum, vpaddlq_s16(v_data_high));
    }

    // Horizontal add the vector to get the sum of all elements
    int32_t sum_array[4];
    vst1q_s32(sum_array, v_sum);

    int sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];

    // Handle the remaining elements
    for (; i < arr.size(); ++i) {
        sum += arr[i];
    }

    return sum;
}

And here are my testing result:

2024-07-12 21:31:51.101 -------- running arm_neon_simd_sum --------
....................................................................................................
2024-07-12 21:32:03.808 Result: 536870900
2024-07-12 21:32:03.808 Total execution time: 12.705522168999996 seconds
2024-07-12 21:32:03.808 Total bytes processed: 107374182400
2024-07-12 21:32:03.808 Throughput: 7.870593484460515 GB/s

And In comparison a naive serial sum seems to be faster:

Code:

int sum_sequential(const std::vector &arr) {
  int sum = 0;
  for (int64_t i = 0; i < kArraySize; i++) {
    sum += arr[i];
  }
  return sum;
}

Result:

2024-07-13 10:28:54.706 -------- running sequential sum --------
....................................................................................................
2024-07-12 21:31:02.750 Result: 536870900
2024-07-12 21:31:02.750 Total execution time: 3.303718871999998 seconds
2024-07-12 21:31:02.750 Total bytes processed: 107374182400
2024-07-12 21:31:02.750 Throughput: 30.268919322261286 GB/s
2024-07-12 21:31:02.750 -------- running random access sum -------

I am wondering why? Am I doing something wrong with my SIMD code that are slowing my code down?

Note this is with -O3 so I do think some auto vectorization might be happening for serial code, so my main concern is how can I achieve the performance of that level with my own code?

Full code can be found here: https://godbolt.org/z/rc5vT9nTM

Why is ARM NEON SIMD Sum is slower than serial sum?

Answers (1)

Related Questions