Alex Xie
Alex Xie

Reputation: 91

How to use intrinsics to elementwise multiply two char arrays and sum up the multiplications into int?

I am not familiar with x86_64 intrinsics, I'd like to have the following operations using 256bit vector registers. I was using _mm256_maddubs_epi16(a, b); however, it seems that this instruction has overflow issue since char*char can exceeds 16-bit maximum value. I have issue understanding _mm256_unpackhi_epi32 and related instructions.

Can anyone elaborate me and show me the light to the destination? Thank you!

int sumup_char_arrays(char *A, char *B, int size) {
    assert (size % 32 == 0);
    int sum = 0;
    for (int i = 0; i < size; i++) {
        sum += A[i]*B[i];
    }
    return sum;
}

Upvotes: 3

Views: 932

Answers (1)

Alex Xie
Alex Xie

Reputation: 91

I've figured out the solution, any idea to improve it, especially the final stage of reduction.

int sumup_char_arrays(char *A, char *B, int size) {
    assert (size % 32 == 0);
    int sum = 0;
    __m256i sum_tmp;
    for (int i = 0; i < size; i += 32) {
        __m256i ma_l = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)A));
        __m256i ma_h = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)(A+16)));
        __m256i mb_l = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)B));
        __m256i mb_h = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)(B+16)));
        __m256i mc = _mm256_madd_epi16(ma_l, mb_l);
        mc = _mm256_add_epi32(mc, _mm256_madd_epi16(ma_h, mb_h));
        sum_tmp = _mm256_add_epi32(mc, sum_tmp);
        //sum += A[i]*B[i];
    }
    sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_permute2x128_si256(sum_tmp, sum_tmp, 0x81));
    sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_srli_si256(sum_tmp, 8));
    sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_srli_si256(sum_tmp, 4));        
    sum = _mm256_extract_epi32(sum_tmp, 0);
    return sum;
}

Upvotes: 1

Related Questions