Reputation: 319
I want an AVX2 (or earlier) intrinsic that will convert an 8-wide 32-bit integer vector (256 bits total) into 8-wide 16-bit integer vector (128 bits total) [discarding the upper 16-bits of each element]. This should be the inverse of "_mm256_cvtepi16_epi32". If there is not a direct instruction, how should I best do this with a sequence of instructions?
Upvotes: 6
Views: 1575
Reputation: 363882
There is no single-instruction inverse until AVX512F. __m128i _mm256_cvtepi32_epi16(__m256i a)
(VPMOVDW
), also available for 512->256 or 128->low_half_of_128. (The versions with inputs smaller than a 512-bit ZMM register also require AVX512VL, so only Skylake-X, not Xeon Phi KNL).
There are signed/unsigned saturation versions of that AVX512 instruction, but only AVX512 has a pack instruction that truncates (discarding the upper bytes of each element) instead of saturating.
Or with AVX512BW, you could emulate a lane-crossing 2-input pack using vpermi2w
to produce a 512-bit result from two 512-bit input vectors. On Skylake-AVX512, it decodes to multiple shuffle uops, but so does VPMOVDW
, which is also a lane-crossing shuffle with granularity less than dword (32-bit). https://agner.org/optimize/ has a spreadsheet of SKX uops / ports, and https://uops.info/ has HTML searchable tables from automated testing which avoids typos.
The SSE2/AVX2 pack instructions like _mm256_packus_epi32
(vpackusdw
) do signed or unsigned saturation, as well as operating within each 128-bit lane. This is unlike the lane-crossing behaviour of vpmovzxwd
.
You could _mm256_and_si256
to clear the high bytes before packing, though. That could be good if you have multiple input vectors, because packs_epi32
takes 2 input vectors and produces a 256-bit output.
a = H G F E | D C B A 32-bit signed elements, shown from high element to low element, low 128-bit lane on the right
b = P O N M | L K J I
_mm256_packus_epi32(a, b) 16-bit unsigned elements
P O N M H G F E | L K J I D C B A
elements from first operand go to the low half of each lane
If you can make efficient use of 2x vpand
/ vpackuswd ymm
/ vpermq ymm
to get a 256-bit vector with all the elements in the right order, then that's probably best on Intel CPUs. Only 2 shuffle uops (4 total uops) per 256 bits of results, and you get them in a single vector.
Or you can use SSSE3 / AVX2 vpshufb
(_mm256_shuffle_epi8
) to extract the bytes you want from a single input, and zero the other half of each 128-bit lane (by setting the shuffle-control value for that element to have the sign bit set). Then use AVX2 vpermq
to shuffle data from the two lanes into just the low 128.
__m256i trunc_elements = _mm256_shuffle_epi8(res256, shuffle_mask_32_to_16);
__m256i ordered = _mm256_permute4x64_epi64(trunc_elements, 0x58);
__m128i result = _mm256_castsi256_si128(ordered); // no asm instructions
So this is 2 uops per 128 bits of results, but both of the uops are shuffles that run only on port 5 on mainstream Intel CPUs that support AVX2. That's fine as part of a loop that does plenty of work that can keep port0 / port1 busy, or if you need each 128-bit chunk separately anyway.
For Ryzen/Excavator, lane-crossing vpermq
is expensive (because they split 256-bit instructions into multiple 128-bit uops, and don't have a real lane-crossing shuffle unit: http://agner.org/optimize/). So you'd want to vextracti128
/ vpor
to combine. Or maybe vpunpcklqdq
so you can load the same shuffle mask with a set1_epi64
instead of needing a full 256-bit vector constant to shuffle elements in the upper lane to the upper 64 bits of that lane.
Upvotes: 7