Reputation: 51

x86_64 Dot Vector Product Intrinsic to ARM64

I am porting a small C routine that includes x86_64 intrinsics to a ARM64 platform. I am unable to find an equivalent ARM64 intrinsic for _mm_dp_pd.

I do have access to the arm neon intrinsics.

I am not sure how to replace x86_64 intrinsic with a ARM64 equivalent.

Any help would be much appreciated.

#ifdef ARM64
    float32x4_t a, b;
#else
    __m128d a, b;
#endif

#ifdef ARM64
    ????
#else
    res = _mm_dp_pd(a, b, mask);
#endif

Upvotes: 1

Answers (1)

Peter Cordes

Reputation: 363980

dppd isn't faster than a vertical multiply / shuffle / add, and in fact decodes to 3 uops on Intel CPUs (https://agner.org/optimize/) which probably do exactly that (with maybe some extra bonus stuff for the mask).

e.g. on Skylake, it's 9c latency with 2 uops for p01 (where the FMA units are) and 1 uop for p5 (where the shuffle unit is).

It's even slower on AMD before Ryzen (e.g. 7 uops on Steamroller), but Ryzen decodes it as 3 uops. (dpps is still slow, though, in case you actually want four 32-bit float elements (float32x4_t) instead of two 64-bit double elements (__m128d)).

Anyway, assuming you want the dot-product result broadcast to both elements of a double vector, do a vertical multiply, then swap one vector and do a vertical add.

Porting this to ARM should be easy

__m128d prods = _mm_mul_pd(a,b);
__m128d swap  = _mm_shuffle_pd(prods,prods, 0b01);
__m128d dot   = _mm_add_pd(prods, swap);

Or if you only care about the low element, then you can use a simpler shuffle like movhlps (Fastest way to do horizontal float vector sum on x86).

If you need the upper element zeroed, like dppd can do, then it might take an extra instruction on AArch64.

And BTW, if you're doing a lot of DPPD, you might want to look at changing your data layout to a struct-of-arrays, so you can do two dot-products in parallel without any shuffling, with a mul and an FMA. See https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ for a good explanation of designing your data layout / whole approach to be SIMD friendly

But horizontal stuff outside an inner loop isn't always bad.

Upvotes: 1

x86_64 Dot Vector Product Intrinsic to ARM64

Answers (1)

Related Questions