Reputation: 59
how to store low 8 bits in every 32 bits data lane, for example, a zmm register stores 16 32-bit integer, I only need to store low 8 bits data to memory which is a int8_t array?
Upvotes: 0
Views: 324
Reputation: 365517
There's an instruction for that, vpmovdb
(https://www.felixcloutier.com/x86/vpmovdb:vpmovsdb:vpmovusdb). But unfortunately Intel CPUs run it as 2 uops, both for port 5. (Or 3 total with a memory destination; https://uops.info/ and https://agner.org/optimize/). C intrinsics
VPMOVDB __m128i _mm512_cvtepi32_epi8( __m512i a);
VPMOVDB __m128i _mm512_mask_cvtepi32_epi8(__m128i s, __mmask16 k, __m512i a);
VPMOVDB __m128i _mm512_maskz_cvtepi32_epi8( __mmask16 k, __m512i a);
VPMOVDB void _mm512_mask_cvtepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);
Fun fact: it allows a memory destination and requires only AVX512F, not AVX512BW, so it's how KNL Xeon Phi could do byte-granularity masked stores. There are signed and unsigned saturating forms, but you want the truncating form.
Truncation and lane-crossing packing are both new in AVX-512; in AVX2 you'd typically have to use 2x vpackssdw
to feed vpackuswb
and a vpermq
fixup, also requiring masking the 32-bit data to be in the 0..255 unsigned range before packing, which would cost an extra vpand
per input vector.
If you have multiple vectors of data, it could be worthwhile to use AVX-512VBMI (Ice Lake / Zen 4) vpermt2b
to grab bytes from two vectors. That runs as 3 uops (2p5 + p015) on Intel CPUs that support it, so it's 2 cycles per 256-bit vector instead of 2 per 128-bit vector.
It's 1 uop on Zen 4, same as vpmovdb
, so vpermt2b
is ideal there, allowing one 256-bit store per clock, twice as fast as Intel and with fewer front-end uops.
vpack...
in-lane pack instructions are single uop on Intel and do 2:1 packing of 2 registers into one of the same width. With a vpermt2d
shuffle at the end, this could come out ahead for multiple vectors.
So given 4 ZMM registers, 4x vpmovdb
would take 8 cycles of throughput on Intel. 4x vpand
+ 2x vpackssdw
+ 1x vpackuswb
+ vpermq
will generate one ZMM ready to store per 4 clock cycles on Skylake/Cascade Lake or later.
(vpermt2w
is available on Skylake-avx512 so would be usable instead of vpackuswb
+ vpermq
, but is 3 uops there and on later Intel, unfortunately.)
Another idea with AVX2 instructions: how to convert uint32 to uint8 using simd but not avx512? - vpshufb
(_mm256_shuffle_epi8
) to pack and zero, setting up for a 2-input shuffle with wider granularity. But that requires one port5 uop per input vector.
Intel's C/C++ intrinsics guide (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) has a search which works on asm mnemonics; they're shorter to type and talk about.
Upvotes: 2