lulle2007200
lulle2007200

Reputation: 927

Efficiently extract single double element from AVX-512 vector

I was wondering what the most efficient way is to extract a single double element from an AVX-512 vector without spilling it, using intrinsics.

Currently i'm doing a masked reduce add:

double extract(int idx, __m512d v){
    __mmask8 mask = _mm512_int2mask(1 << idx);
    return _mm512_mask_reduce_add_pd(mask, v);
}

I can't imagine that this is a good way to do it.

Upvotes: 1

Views: 1422

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365267

reduce_add isn't a hardware instruction, it's a helper function that does a whole chain of shuffles and adds!

A vector with the element you want as the low element is the same thing as a scalar double; double _mm512_cvtsd_f64(__m512d) is free so you really just need a way to get the element you want to the bottom.


If you had a mask to start with, vcompresspd (_mm512_maskz_compress_pd) would be a good option, 2 uops on Intel and Zen 4. (https://uops.info/)


But you have an integer index to start with, so you can use that as as a shuffle-control vector to get the element you want to the low element of a vector. e.g. you want the compiler to do something like vmovd xmm1, edi / vpermpd zmm0,zmm,zmm for this function.

As an extension, GNU C allows return v[idx]. Agner Fog's VectorClass library in C++ overloads operator[] to also work that way, generating a shuffle for you to keep MSVC happy.

You can do it manually, assuming the index isn't a compile-time constant, with vpermpd. (For a known-constant index, other shuffles might be even more efficient, like AVX2 vpermpd ymm, ymm, imm8 if the element number happens to be in the low 4, or a 128-bit-granularity immediate shuffle if the element number is 2, 4, or 6, otherwise valignq dst, same,same, imm8.)

double extract512(int idx, __m512d v)
{
#ifdef __GNUC__
    return v[idx];  // let the compiler optimize it, in case of compile-time constant index or vector
#else
    __m512i shuf = _mm512_castsi128_si512( _mm_cvtsi32_si128(idx) );  // vmovd
    return _mm512_cvtsd_f64( _mm512_permutexvar_pd(shuf, v) );        // vpermpd
#endif
}

Interestingly, gcc and clang both choose to store/reload after aligning the stack, taking advantage of store-forwarding from an aligned 64-byte store to an 8-byte aligned reload. (Godbolt). They do compile the manual shuffle to the expected vmovd + vpermpd (or vpermq for clang because it loves to rewrite shuffles.)

Either option is reasonable for throughput (at least after inlining into a loop, so stack alignment doesn't have to get redone for every extract). Modern Intel has a lot of load and store ports (2 each on Ice Lake), but only two vector ALU ports that can be active while running 512-bit uops.

The shuffle is probably lower latency from vector input to scalar output. (And with multiple vectors using the same index, could reuse the same shuffle-control vector.)

Maybe similar latency from index input to scalar output, with vmovd + vpermpd both on the critical path adding up to something close to store-forwarding latency.

For a non-inline function, the extra work of aligning the stack is definitely not worth it. (If they were going to store/reload in this non-inline function, they could align a pointer other than RSP into the red-zone, like lea rax, [rsp-64] / and rax, -64, no need to change RSP at all. But that would be using an unknown 64 bytes of the 128-byte red-zone, not efficient if there were any other locals.) Store-forwarding on most CPUs works fairly well even with misaligned stores, especially when the reload is aligned relative to the store.


Depending on surrounding code, you might choose to use the manual shuffle version. If you aren't sure, try both and look at the asm and/or benchmark. For example:

double extract512(int idx, __m512d v)
{
#ifdef __GNUC__   // let the compiler optimize compile-time-constant shuffles after inlining
    if (__builtin_constant_p(idx)) {
        return v[idx];
    }
    // else use the intrinsics for runtime-variable shuffles.
#endif

    __m512i shuf = _mm512_castsi128_si512( _mm_cvtsi32_si128(idx) );  // vmovd
    return _mm512_cvtsd_f64( _mm512_permutexvar_pd(shuf, v) );        // vpermpd
}

Compile-time constant shuffles

GCC and clang code-gen for compile-time constant idx. GCC generally does a very good job with return v[idx], while clang sometimes mangles it or the vpermpd intrinsic to 2 shuffles.

// optimal is GCC's valignq zmm0,zmm0,zmm0, 7
double test7(__m512d v){         return extract512(7, v); } // using v[7]
double test7_manual(__m512d v){  return extract512_manual(7, v); } // using intrinsics
# clang 16.0 -O3 -march=znver4    same as -O3 -march=x86-64-v4
test7:                          # same code-gen for v[7] and intrinsics
        vextractf32x4   xmm0, zmm0, 3
        vpermilpd       xmm0, xmm0, 1   # quite disappointing
        vzeroupper
        ret

# GCC12.2 -O3 -march=x86-64-v4  make optimal asm for this
test7:                           # return v[7]
        valignq zmm0, zmm0, zmm0, 7
        ret                    # GCC knows the caller passed a ZMM arg and will do its own vzeroupper.

test7_manual:                    # GCC intrinsics with idx=7
        vmovdqa xmm1, XMMWORD PTR .LC0[rip]  # optimized the shuffle constant down from 64 to 16 bytes, but could have used a vmovd 4-byte load
        vpermpd zmm0, zmm1, zmm0
        ret

Shuffles with idx from 0 to 3 present more optimization opportunities, since the data only comes from the low YMM. So we can look at AVX1 and AVX2 instructions. idx=4 and 6 are also interesting, where the element we want is at the bottom of a 128-bit or 256-bit lane.

##### Clang
test4:   # clang - vextractf64x4 ymm0, zmm0, 1  (or 32x8) is much better on Zen 4 but same on Intel
        vextractf32x4   xmm0, zmm0, 2
        vzeroupper
        ret

test3:   # clang - quite inefficient
        vextractf128    xmm0, ymm0, 1
        vpermilpd       xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        vzeroupper
        ret

test2:   # clang - good
        vextractf128    xmm0, ymm0, 1
        vzeroupper
        ret

test1:  # clang - good, could save 2 bytes of code size with vmovhlps or vunpckhpd but otherwise fine
        vpermilpd       xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        vzeroupper
        ret

test0:   # clang - good
        vzeroupper
        ret

#### GCC
test4:     # GCC v[4]  : good, but vextractf64x4 ymm0, zmm0, 1  has better throughput (0.25) and latency (1c) on Zen 4
        vextractf64x2   xmm0, zmm0, 2
        ret
test3:     # GCC v[3]  : a YMM shuffle would be more efficient, especially on Zen 4, like vpermpd ymm0, ymm0, 3
        valignq zmm0, zmm0, zmm0, 3
        ret
test2:  # GCC v[2]    : Good but AVX1 vextractf128 saves a byte
        vextractf64x2   xmm0, ymm0, 1
        ret
test1:  # GCC v[1]    : Good, optimal I think.  (or maybe vshufpd for better Ice Lake throughput.)
        vunpckhpd       xmm0, xmm0, xmm0
        ret
test0:  # GCC v[0]    : Obviously good.
        ret

GCC's testn_manual code-gen is always a vmovdqa xmm load and a vpermpd zmm, except for test0_manual where it used vpxor xmm1, xmm1, xmm1 to zero a shuffle-control still for vpermpd zmm.


Related, mostly SSE and AVX1/2, not AVX-512:

Upvotes: 3

Related Questions