Reputation: 7411
I have been trying to benchmark memory bandwidth on my M2 Mac and one thing I noticed that when I try to use ARM NEON SIMD it seems that it has a slower processing time and lower memory bandwidth. This is my code:
int arm_neon_simd_sum(const std::vector<int8_t>& arr) {
// Check if the array is empty
if (arr.empty()) return 0;
int32x4_t v_sum = vdupq_n_s32(0); // Initialize NEON vector to hold partial sums
size_t i = 0;
// Process the array in chunks of 16 elements
for (; i + 15 < arr.size(); i += 16) {
int8x16_t v_data = vld1q_s8(&arr[i]); // Load 16 elements into a NEON register
// Convert int8x16_t to int16x8_t
int16x8_t v_data_low = vmovl_s8(vget_low_s8(v_data)); // Lower 8 elements to int16
int16x8_t v_data_high = vmovl_s8(vget_high_s8(v_data)); // Upper 8 elements to int16
// Convert int16x8_t to int32x4_t and accumulate
v_sum = vaddq_s32(v_sum, vpaddlq_s16(v_data_low));
v_sum = vaddq_s32(v_sum, vpaddlq_s16(v_data_high));
}
// Horizontal add the vector to get the sum of all elements
int32_t sum_array[4];
vst1q_s32(sum_array, v_sum);
int sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];
// Handle the remaining elements
for (; i < arr.size(); ++i) {
sum += arr[i];
}
return sum;
}
And here are my testing result:
2024-07-12 21:31:51.101 -------- running arm_neon_simd_sum --------
....................................................................................................
2024-07-12 21:32:03.808 Result: 536870900
2024-07-12 21:32:03.808 Total execution time: 12.705522168999996 seconds
2024-07-12 21:32:03.808 Total bytes processed: 107374182400
2024-07-12 21:32:03.808 Throughput: 7.870593484460515 GB/s
And In comparison a naive serial sum seems to be faster:
Code:
int sum_sequential(const std::vector<int8_t> &arr) {
int sum = 0;
for (int64_t i = 0; i < kArraySize; i++) {
sum += arr[i];
}
return sum;
}
Result:
2024-07-13 10:28:54.706 -------- running sequential sum --------
....................................................................................................
2024-07-12 21:31:02.750 Result: 536870900
2024-07-12 21:31:02.750 Total execution time: 3.303718871999998 seconds
2024-07-12 21:31:02.750 Total bytes processed: 107374182400
2024-07-12 21:31:02.750 Throughput: 30.268919322261286 GB/s
2024-07-12 21:31:02.750 -------- running random access sum -------
I am wondering why? Am I doing something wrong with my SIMD code that are slowing my code down?
Note this is with -O3
so I do think some auto vectorization might be happening for serial code, so my main concern is how can I achieve the performance of that level with my own code?
Full code can be found here: https://godbolt.org/z/rc5vT9nTM
Upvotes: 0
Views: 143
Reputation: 52622
Don’t use Neon instructions directly. First, they don’t work on Intel Macs. Second, you force the compiler to literally follow your instructions.
Use just the clang short vector types (for example double with two, four or eight values per vector), and let the compiler worry about optimisation. It is more portable, more readable, easier to write and usually faster.
Upvotes: 0