Reputation: 51
I am trying to convert some code using ARM NEON intrinsics to use Intel intrinsics instead.
I immediately got stuck and am trying to find the appropriate Intel intrinsics to replace the NEON intrinsics. My first hurdle is to translate the following function:
void sad_row_8(uint8_t *row1, uint8_t *row2, int *result)
{
*result = 0;
uint8x8_t vec1 = vld1_u8(row1);
uint8x8_t vec2 = vld1_u8(row2);
uint8x8_t absvec = vabd_u8(vec1, vec2);
*result += vaddlv_u8(absvec);
}
In the code above, row1 and row2 are pointers to rows of at least 8 consecutive uin8_t elements. The function computes the sum of absolute differences between two rows of uint8_t elements.
When writing code using NEON intrinsics I used https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon] to find appropriate intrinsics, and I never had much trouble finding what I needed. In my attempt to find the correct Intel intrinsics to translate the code above, I have attempted using https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512&ig_expand=54,6050&cats=Load . Here, I have tried to find corresponding intrinsics to the ones I have used in the NEON solution, but without much luck.
What I am looking for is help/advice on how I can better approach this problem, perhaps by pointing out some (possibly?) obvious flaws in my approach.
My processor is an Intel Core i5-11400F, which according to Intel has the instruction set extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512.
Upvotes: 0
Views: 424
Reputation: 20037
Sum of absolute differences is done a bit differently in Intel.
In Neon programming one uses traditional per-lane abd
operation and preferably widening accumulation - then a final horizontal reduction.
In Intel the intrinsic _mm_sad_epu8
instead performs two abd+horizontal reductions in parallel:
1 2 3 1 2 3 1 2 | 0 1 0 1 4 2 1 0 | < register A
0 0 0 0 1 1 1 1 | 2 2 2 2 3 3 3 3 | < register B
-----------------------------------
1 2 3 1 1 2 0 1 2 1 2 1 1 1 2 3 < -- Neon vabdq_u8(A,B)
11 (uint64_t) 13 (uint64_t) < -- Intel _mm_sad_epu8(A,B)
The corresponding intel routine would be
void sad_row_8(uint8_t *row1, uint8_t *row2, int *result)
{
*result = 0;
__m128i vec1 = _mm_loadu_si64(row1);
__m128i vec2 = _mm_loadu_si64(row2);
__m128i absvec = _mm_sad_epu8(vec1, vec2);
*result += _mm_cvtsi128_si32(absvec);
}
Upvotes: 4