terdev
terdev

Reputation: 87

Are there ARM64 equivalents for x86-64 SSE2 integer SIMD GCC built-in functions?

Im trying to use an AMM-Algorithm (approximate-matrix-multiplication; on Apple's M1), which is fully based on speed and uses the x86 built-in functions listed below. Since using a VM for x86 slows down several crucial processes in the algorithm, I was wondering if there is another way to run it on ARM64.

I also could not find a fitting documentation for the ARM64 built-in functions, which could eventually help mapping some of the x86-64 instructions.

Used built-in functions:

__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd

Upvotes: 1

Views: 3111

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364180

Normally you'd use intrinsics instead of the raw GCC builtin functions, but see https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html. The __builtin_arm_... and __builtin_aarch64_... functions like __builtin_aarch64_saddl2v16qi don't seem to be documented in the GCC manual the way the x86 ones are, just another sign they're not intended for direct use.

See also https://developer.arm.com/documentation/102467/0100/Why-Neon-Intrinsics- re intrinsics and #include <arm_neon.h>. GCC provides a version of that header, with the documented intrinsics API implemented using __builtin_aarch64_... GCC builtins.


As far as portability libraries, AFAIK not from the raw builtins, but SIMDe (https://github.com/simd-everywhere/simde) has portable implementations of immintrin.h Intel intrinsics like _mm_packs_epi16. Most code should be using that API instead of GNU C builtins, unless you're using GNU C native vectors (__attribute__((vector_size(16))) for portable SIMD without any ISA-specific stuff. But that's not viable when you want to take advantage of special shuffles and stuff.

And yes, ARM does have narrowing with saturation with instructions like vqmovn (https://developer.arm.com/documentation/dui0473/m/neon-instructions/vqmovn-and-vqmovun), so SIMDe can efficiently emulate pack instructions. That's AArch32, not 64, but hopefully there's an equivalent AArch64 instruction.

Upvotes: 2

Related Questions