Reputation: 1347

Compact storage of shuffle vectors: unpacking 4 bytes to shuffle uint32_t elements with a byte-shuffle

I have a cross architecture code that looks up a shuffle by index, for moving uint32_t elements within a vector. A whole vector constant is needed for each shuffle, but there are only 4 bytes of non-redundant information. (Or really 4x 2 bits of information, but that would be more expensive to unpack.)

On SSSE3-SSE4.2 I use _mm_shuffle_epi8 and on arm it's table intrinsics.

However, right now I store the whole shuffle mask, aka control vector, so for example for identity for int I will store: 0x0f0e0d0c0b0a09080706050403020100

I would like to just store 0x03020100, with each unique shuffle control element stored in a single byte / uint8_t.

Is there an efficient way to get from one to the other? convert + multiply seems a bit heavy.

Upvotes: 1

Answers (1)

Peter Cordes

Reputation: 365256

Store your packed LUT with each byte holding the starting byte number, so you don't need to scale them up.
Broadcast each control index into the bytes of the corresponding element (1 fixed shuffle), then add a constant set1_epi32(0x03020100) to offset them.

  __m128i v = _mm_cvtsi32_si128(shuffle_lut[i]);

  v = _mm_shuffle_epi8(v, _mm_set1_epi32(0x03030303, 0x02020202, 0x01010101, 0x00000000));  // broadcast each byte into a dword
  v = _mm_add_epi8(v, _mm_set1_epi32(0x03020100));   // offset the byte indices

 // v is your shuffle-control vector, usable with another pshufb
 // as if you'd just unpacked lut[i]>>2 to dwords for vpermilps

The identity shuffle is stored as 0x0c080400. 0x0c + 0x03 = 0x0f in the top byte of the top element.

I guess your LUT in C is actually done as uint32_t shuffle_lut, in which case you don't have to worry about doing strict-aliasing-safe dword loads. Intrinsics support for that is dicey, but _mm_cvtsi32_si128 for movd is easy to use. It takes a value (not an address), so in C terms the memory access happens in pure C. The compiler can still fold the load into a memory operand for movd though.

BTW, I assume you said up to SSE4.2 because AVX1 has _mm_permutevar_ps (vpermilps), so a _mm_cvtepu8_epi32 (pmovzxbd) can unpack a 4-byte load for that without further modification. Using dword indices, not byte indices, so you would store the identity shuffle as 0x03020100 for this.

Unfortunately getting the compiler to emit a memory-source vpmovzxbd xmm0, [rdi] instruction from intrinsics code is a pain with compilers other than clang. They often fail to fold the movd or movq load intrinsic into a memory source operand, but you have to use that not a full __m128i load if you don't want to go past the end of a buffer in a debug build. See Loading 8 chars from memory into an __m256 variable as packed single precision floats for actual compiler results a few years ago.

AVX2 or BMI2+AVX packing into a single byte

There are really only 2 bits of information per shuffle index, so four indices can be packed into 1 byte (uint8_t).

On way to unpack is with BMI2 integer pdep. i.e. _pdep_u32(lut[i], 0x03030303. Then vmovd / vpmovzxbd / vpermilps. Perhaps the pdep could even be replaced by a multiplier constant, since vpermilps only cares about the low 2 bits of each dword.

But pext is very slow on AMD before Zen3. And even on Intel, that's a significant amount of latency to load into integer first.

Another option is using AVX2 variable-shift to bring the appropriate 2 bits to the bottom of each dword element. Start with a broadcast load of the byte. Or more efficiently in most cases (except cache line splits), a dword broadcast which CPUs can do "for free" in a load port, no separate ALU shuffle uop needed. (https://uops.info/)

It's a pain to avoid strict-aliasing UB for that, e.g. _mm_set1_epi32( *(uint32_t*) &lut[i] ) isn't safe. But there is an intrinsic that takes a pointer, _mm_broadcast_ss.

  // make sure LUT[] doesn't end right at the end of a page
  // so we can broadcast-load 4 bytes starting at any byte offset in it.
  // i.e. pad it by 3 bytes if needed.
  __m128i v = _mm_castps_si128( _mm_broadcast_ss( (const float*)&LUT[i] ));

  // alternative:  __m128i v = _mm_set1_epi8( LUT[i] );  // vpbroadcastb is an extra shuffle uop, but narrower load

  v = _mm_srlv_epi32(v, _mm_set_epi32(6, 4, 2, 0));

  // ready for _mm_permutevar_ps
 // low 2 bits of each 32-bit element of v are correct

It's not necessary to _mm_and_si128; vpermilps doesn't care about high garbage in the control vector elements.

Note that there's no XMM version of AVX2 vpermd, so even with AVX2 available, vpermilps is still the best choice of variable-control shuffle which uses 32-bit granularity.

(Unless you want to widen your whole algorithm to 8 elements in a __m256i, then yeah use lane-crossing vpermd aka _mm256_permutexvar_epi32. But then you need 8 x 3 bits of shuffle-control data = 3 bytes not 1. And then there are probably still too many possibilities to make a LUT for.)

Also related:

AVX2 what is the most efficient way to pack left based on a mask? generating shuffle vectors on the fly instead of looking them up, because the LUT would have to be too big for 8-element vectors.
Fastest way to get IPv4 address from string compresses a LUT of shuffle control vectors by processing an _mm_movemask_epi8 result instead of using it directly to index 65536 x __m128i (1MiB) table of shuffle vectors.

Upvotes: 3

Compact storage of shuffle vectors: unpacking 4 bytes to shuffle uint32_t elements with a byte-shuffle

Answers (1)

AVX2 or BMI2+AVX packing into a single byte

Related Questions