JGL
JGL

Reputation: 158

SIMD: more generic shuffle function

I think the SIMD shuffle fucntion is not real shuffle for int32_t case the left and right part would be shuffled separately.

I want a real shuffle function as following:

Assumed we got __m256i and we want to shuffle 8 int32_t.

__m256i to_shuffle = _mm256_set_epi32(17, 18, 20, 21, 25, 26, 29, 31);

const int imm8 = 0b10101100;

__m256i shuffled _mm256_shuffle(to_shuffle, imm8);

I hope the shuffled = {17, 20, 25, 26, -, -, -, -}, where the - represents the not relevant value and they can be anything. So I hope the int at the position with set bit with 1 would be placed in shuffled.

(In our case: 17, 20, 25, 26 are sitting at the positions with a 1 in the imm8).


Is such function offered by the Intel? How could such function be implemented efficiently?


EDIT: - could be ignored. Only the int with set bit 1 is needed.

Upvotes: 0

Views: 1176

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364428

(I'm assuming you got your immediate backwards (selector for 17 should be the low bit, not high bit) and your vectors are actually written in low-element-first order).

How could such function be implemented efficiently?

In this case with AVX2 vpermd ( _mm256_permutevar8x32_epi32 ). It needs a control vector not an immediate, to hold 8 selectors for the 8 output elements. So you'd have to load a constant and use that as the control operand.

Since you only care about the low half of your output vector, your vector constant can be only __m128i, saving space. vmovdqa xmm, [mem] zero-extends into the corresponding YMM vector. It's probably inconvenient to write this in C with intrinsics but _mm256_castsi128_si256 should work. Or even _mm256_broadcastsi128_si256 because a broadcast-load would be just as cheap. Still, some compilers might pessimize it to an actual 32-byte constant in memory by doing constant-propagation. If you know assembly, compiler output is frequently disappointing.

If you want to take an actual integer bitmap in your source, you could probably use C++ templates to convert that at compile time into the right vector constant. Agner Fog's Vector Class Library (now Apache-licensed, previously GPL) has some related things like that, turning integer constants into a single blend or sequence of blend instructions depending on the constant and what target ISA is supported, using C++ templates. But its shuffle template takes a list of indices, not a bitmap.

But I think you're trying to ask about why / how x86 shuffles are designed the way they are.


Is such function offered by the Intel?

Yes, in hardware with AVX512F (plus AVX512VL to use it on 256-bit vectors).

You're looking for vpcompressd, the vector-element equivalent of BMI2 pext. (But it takes the control operand as a mask register value, not an immediate constant.) The intrinsic is
__m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a);
It's also available in a version that merges into the bottom of an existing vector instead of zeroing the top elements.


As an immediate shuffle, no.

All x86 shuffles use a control operand that has indices into the source, not a bitmap of which elements to keep. (Except vpcompressd/q and vpexpandd/q). Or they use an implicit control, like _mm256_unpacklo_epi32 for example which interleaves 32-bit elements from 2 inputs (in-lane in the low and high halves).

If you're going to provide a shuffle with a control operand at all, it's usually most useful if any element can end up at any position. So the output doesn't have to be in the same order as the input. Your compress shuffle doesn't have that property.

Also, having a source index for each output element is what shuffle hardware naturally wants. My understanding is that each output element is fed by its own MUX (multiplexer), where the MUX takes N input elements and one binary selector to select which one to output. (And is as wide as the element width of course.) See Where is VPERMB in AVX2? for more discussion of building muxers.

Having the control operand in some format other than a list of selectors would require preprocessing before it could be fed to shuffle hardware.

For an immediate, the format is either 2x1-bit or 4x2-bit fields, or a byte-shift count for _mm_bslli_si128 and _mm_alignr_epi8. Or index + zeroing bitmask for insertps. There are no SIMD instructions with an immediate wider than 8 bits. Presumably this keeps the hardware decoders simple.

(Or 1x1-bit for vextractf128 xmm, ymm, 0 or 1, which in hindsight would be better with no immediate at all. Using it with 0 is always worse than vmovdqa xmm, xmm. Although AVX512 does use the same opcode for vextractf32x4 with an EVEX prefix for the 1x2-bit immediate, so maybe this had some benefit for decoder complexity. Anyway, there are no immediate shuffles with selector fields wider than 2 bits because 8x 3-bit would be 24 bits.)

For wider 4x2 in-lane shuffles like _mm256_shuffle_ps (vshufps ymm, ymm, ymm, imm8), the same 4x2-bit selector pattern is reused for both lanes. For wider 2x1 in-lane shuffles like _mm256_shuffle_pd (vshufpd ymm, ymm, ymm, imm8), we get 4x 1-bit immediate fields that still select in-lane.

There are lane-crossing shuffles with 4x 2-bit selectors, vpermq and vpermpd. Those work exactly like pshufd xmm (_mm_shuffle_epi32) but with 4x qword elements across a 256-bit register instead of 4x dword elements across a 128-bit register.


As far as narrowing / only caring about part of the output:

A normal immediate would need 4x 3-bit selectors to each index one of the 8x 32-bit source elements. But much more likely 8x 3-bit selectors = 24 bits, because why design a shuffle instruction that can only ever write half a half-width output? (Other than vextractf128 xmm, ymm, 1).

General the paradigm for more-granular shuffles is to take a control vector, rather than some funky immediate encoding.

AVX512 did add some narrowing shuffles like VPMOVDB xmm/[mem], x/y/zmm that truncate (or signed/unsigned saturate) 32-bit elements down to 8-bit. (And all other combinations of sizes are available).

They're interesting because they're available with a memory destination. Perhaps this is motivated by some CPUs (like Xeon Phi KNL / KNM) not having AVX512VL, so they can only use AVX512 instructions with ZMM vectors. Still, they have AVX1 and 2 so you could compress into an xmm reg and use a normal VEX-encoded store. But it does allow doing a narrow byte-masked store with AVX512F, which would only be possible with AVX512BW if you had the packed data in an XMM register.

There are some 2-input shuffles like shufps that treat the low and high half of the output separately, e.g. the low half of the output can select from elements of the first source, the high half of the output can select from elements of the second source register.

Upvotes: 2

Related Questions