Reputation: 91
I am not familiar with x86_64 intrinsics, I'd like to have the following operations using 256bit vector registers. I was using _mm256_maddubs_epi16(a, b); however, it seems that this instruction has overflow issue since char*char can exceeds 16-bit maximum value. I have issue understanding _mm256_unpackhi_epi32 and related instructions.
Can anyone elaborate me and show me the light to the destination? Thank you!
int sumup_char_arrays(char *A, char *B, int size) {
assert (size % 32 == 0);
int sum = 0;
for (int i = 0; i < size; i++) {
sum += A[i]*B[i];
}
return sum;
}
Upvotes: 3
Views: 932
Reputation: 91
I've figured out the solution, any idea to improve it, especially the final stage of reduction.
int sumup_char_arrays(char *A, char *B, int size) {
assert (size % 32 == 0);
int sum = 0;
__m256i sum_tmp;
for (int i = 0; i < size; i += 32) {
__m256i ma_l = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)A));
__m256i ma_h = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)(A+16)));
__m256i mb_l = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)B));
__m256i mb_h = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)(B+16)));
__m256i mc = _mm256_madd_epi16(ma_l, mb_l);
mc = _mm256_add_epi32(mc, _mm256_madd_epi16(ma_h, mb_h));
sum_tmp = _mm256_add_epi32(mc, sum_tmp);
//sum += A[i]*B[i];
}
sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_permute2x128_si256(sum_tmp, sum_tmp, 0x81));
sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_srli_si256(sum_tmp, 8));
sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_srli_si256(sum_tmp, 4));
sum = _mm256_extract_epi32(sum_tmp, 0);
return sum;
}
Upvotes: 1