Reputation: 927
I was wondering what the most efficient way is to extract a single double element from an AVX-512 vector without spilling it, using intrinsics.
Currently i'm doing a masked reduce add:
double extract(int idx, __m512d v){
__mmask8 mask = _mm512_int2mask(1 << idx);
return _mm512_mask_reduce_add_pd(mask, v);
}
I can't imagine that this is a good way to do it.
Upvotes: 1
Views: 1422
Reputation: 365267
reduce_add
isn't a hardware instruction, it's a helper function that does a whole chain of shuffles and adds!
A vector with the element you want as the low element is the same thing as a scalar double
; double _mm512_cvtsd_f64(__m512d)
is free so you really just need a way to get the element you want to the bottom.
If you had a mask to start with, vcompresspd
(_mm512_maskz_compress_pd
) would be a good option, 2 uops on Intel and Zen 4. (https://uops.info/)
But you have an integer index to start with, so you can use that as as a shuffle-control vector to get the element you want to the low element of a vector. e.g. you want the compiler to do something like vmovd xmm1, edi
/ vpermpd zmm0,zmm,zmm
for this function.
As an extension, GNU C allows return v[idx]
. Agner Fog's VectorClass library in C++ overloads operator[]
to also work that way, generating a shuffle for you to keep MSVC happy.
You can do it manually, assuming the index isn't a compile-time constant, with vpermpd
. (For a known-constant index, other shuffles might be even more efficient, like AVX2 vpermpd ymm, ymm, imm8
if the element number happens to be in the low 4, or a 128-bit-granularity immediate shuffle if the element number is 2, 4, or 6, otherwise valignq dst, same,same, imm8
.)
double extract512(int idx, __m512d v)
{
#ifdef __GNUC__
return v[idx]; // let the compiler optimize it, in case of compile-time constant index or vector
#else
__m512i shuf = _mm512_castsi128_si512( _mm_cvtsi32_si128(idx) ); // vmovd
return _mm512_cvtsd_f64( _mm512_permutexvar_pd(shuf, v) ); // vpermpd
#endif
}
Interestingly, gcc and clang both choose to store/reload after aligning the stack, taking advantage of store-forwarding from an aligned 64-byte store to an 8-byte aligned reload. (Godbolt). They do compile the manual shuffle to the expected vmovd
+ vpermpd
(or vpermq
for clang because it loves to rewrite shuffles.)
Either option is reasonable for throughput (at least after inlining into a loop, so stack alignment doesn't have to get redone for every extract). Modern Intel has a lot of load and store ports (2 each on Ice Lake), but only two vector ALU ports that can be active while running 512-bit uops.
The shuffle is probably lower latency from vector input to scalar output. (And with multiple vectors using the same index, could reuse the same shuffle-control vector.)
Maybe similar latency from index input to scalar output, with vmovd
+ vpermpd
both on the critical path adding up to something close to store-forwarding latency.
For a non-inline function, the extra work of aligning the stack is definitely not worth it. (If they were going to store/reload in this non-inline function, they could align a pointer other than RSP into the red-zone, like lea rax, [rsp-64]
/ and rax, -64
, no need to change RSP at all. But that would be using an unknown 64 bytes of the 128-byte red-zone, not efficient if there were any other locals.) Store-forwarding on most CPUs works fairly well even with misaligned stores, especially when the reload is aligned relative to the store.
Depending on surrounding code, you might choose to use the manual shuffle version. If you aren't sure, try both and look at the asm and/or benchmark. For example:
double extract512(int idx, __m512d v)
{
#ifdef __GNUC__ // let the compiler optimize compile-time-constant shuffles after inlining
if (__builtin_constant_p(idx)) {
return v[idx];
}
// else use the intrinsics for runtime-variable shuffles.
#endif
__m512i shuf = _mm512_castsi128_si512( _mm_cvtsi32_si128(idx) ); // vmovd
return _mm512_cvtsd_f64( _mm512_permutexvar_pd(shuf, v) ); // vpermpd
}
GCC and clang code-gen for compile-time constant idx
. GCC generally does a very good job with return v[idx]
, while clang sometimes mangles it or the vpermpd intrinsic to 2 shuffles.
// optimal is GCC's valignq zmm0,zmm0,zmm0, 7
double test7(__m512d v){ return extract512(7, v); } // using v[7]
double test7_manual(__m512d v){ return extract512_manual(7, v); } // using intrinsics
# clang 16.0 -O3 -march=znver4 same as -O3 -march=x86-64-v4
test7: # same code-gen for v[7] and intrinsics
vextractf32x4 xmm0, zmm0, 3
vpermilpd xmm0, xmm0, 1 # quite disappointing
vzeroupper
ret
# GCC12.2 -O3 -march=x86-64-v4 make optimal asm for this
test7: # return v[7]
valignq zmm0, zmm0, zmm0, 7
ret # GCC knows the caller passed a ZMM arg and will do its own vzeroupper.
test7_manual: # GCC intrinsics with idx=7
vmovdqa xmm1, XMMWORD PTR .LC0[rip] # optimized the shuffle constant down from 64 to 16 bytes, but could have used a vmovd 4-byte load
vpermpd zmm0, zmm1, zmm0
ret
Shuffles with idx from 0 to 3 present more optimization opportunities, since the data only comes from the low YMM. So we can look at AVX1 and AVX2 instructions. idx=4 and 6 are also interesting, where the element we want is at the bottom of a 128-bit or 256-bit lane.
##### Clang
test4: # clang - vextractf64x4 ymm0, zmm0, 1 (or 32x8) is much better on Zen 4 but same on Intel
vextractf32x4 xmm0, zmm0, 2
vzeroupper
ret
test3: # clang - quite inefficient
vextractf128 xmm0, ymm0, 1
vpermilpd xmm0, xmm0, 1 # xmm0 = xmm0[1,0]
vzeroupper
ret
test2: # clang - good
vextractf128 xmm0, ymm0, 1
vzeroupper
ret
test1: # clang - good, could save 2 bytes of code size with vmovhlps or vunpckhpd but otherwise fine
vpermilpd xmm0, xmm0, 1 # xmm0 = xmm0[1,0]
vzeroupper
ret
test0: # clang - good
vzeroupper
ret
#### GCC
test4: # GCC v[4] : good, but vextractf64x4 ymm0, zmm0, 1 has better throughput (0.25) and latency (1c) on Zen 4
vextractf64x2 xmm0, zmm0, 2
ret
test3: # GCC v[3] : a YMM shuffle would be more efficient, especially on Zen 4, like vpermpd ymm0, ymm0, 3
valignq zmm0, zmm0, zmm0, 3
ret
test2: # GCC v[2] : Good but AVX1 vextractf128 saves a byte
vextractf64x2 xmm0, ymm0, 1
ret
test1: # GCC v[1] : Good, optimal I think. (or maybe vshufpd for better Ice Lake throughput.)
vunpckhpd xmm0, xmm0, xmm0
ret
test0: # GCC v[0] : Obviously good.
ret
GCC's testn_manual
code-gen is always a vmovdqa xmm
load and a vpermpd zmm
, except for test0_manual
where it used vpxor xmm1, xmm1, xmm1
to zero a shuffle-control still for vpermpd zmm
.
Related, mostly SSE and AVX1/2, not AVX-512:
How to get data out of AVX registers? - for SSE there was a _MM_EXTRACT_FLOAT
CPP macro, among other things.
print a __m128i variable - if you need all the elements, store to an array and index it. Compilers often optimize this to a shuffle if you only use one element.
Upvotes: 3