Reputation: 468
Is there an efficient way to convert a larger integer type down to a smaller integer type (with truncation of course) using AVX2 and SSE?
Such as:
I know AVX-512 has the instructions:
Which correspond to intrinsics like:
which handle integer type narrowing conversions but how do you accomplish this in AVX2 and SSE which have no such instructions?
Please note that while there are 128 and 256 bit overloads for the above AVX512 intrinsics, they still require AVX512 at runtime. I'm looking for ways to accomplish the same things using only AVX2 and/or SSE instructions.
I know this a lot to ask, if you don't want to give me the full answer I understand. Just please explain to me or help me find an algorithm I can replicate for each conversion.
Also please specify if your answer will work for both signed and unsigned integers or otherwise.
Thank you very much.
To clarify; I don't plan on unpacking this data later. I want to use vectorized narrowing conversions in a number of applications such as:
pretty much anything you can do with: int32_t y = some_func(); int8_t x = static_cast<int8_t>(y);
But vectorized. I've measured the performance of our code (sorry NDA, can't show any here) and found that we spend a lot of time on type conversion. Particularly narrowing conversions. The compiler does a pretty good job of vectorizing small->BIG conversions with sign extension using the instructions available with AVX2 but narrowing conversions don't seem to get vectorized much at all.
Also to clarify some more. I want a means of casting that behaves like static_cast (as in with truncation)
Upvotes: 0
Views: 1355
Reputation: 21936
One good way is pshufb
instruction, exposed as _mm_shuffle_epi8
or _mm256_shuffle_epi8
intrinsics. On modern CPUs that instruction is pretty fast: the latency is 1 cycle, and they can run 2 of them per clock.
Here’s an example which converts int32 -> int16, untested. By changing the shuffle magic vector, you can implement the rest of the conversations with the same code.
// Convert int32 lanes into int16 with truncation
inline __m128i cvtepi32_epi16( __m256i x )
{
// Make a shuffle constant vector
// If you also need 16-byte version, move this to global variable,
// to reuse the lower 16 bytes of the same constant
const __m256i shuffle = _mm256_setr_epi8(
0, 1, 4, 5, 8, 9, 12, 13,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
0, 1, 4, 5, 8, 9, 12, 13 );
// Move these bytes
// Unfortunately that instruction can only move within 16-byte halves.
// Fortunately, it can selectively zero out bytes, so a bitwise OR is enough to combine
x = _mm256_shuffle_epi8( x, shuffle );
// Split into halves
const __m128i low = _mm256_castsi256_si128( x );
const __m128i high = _mm256_extracti128_si256( x, 1 );
// Produce the result
return _mm_or_si128( low, high );
}
For SSE vectors it’s even more efficient than for AVX2, you only need a single _mm_shuffle_epi8
instruction. BTW for both SSE and AVX, that instruction can load the second argument directly from memory, i.e. even if that code is not inlined, it does not requite a separate load to fetch the permutation constant.
P.S. Signed/unsigned is irrelevant for truncation, the downcasting operation is identical between the two. Signed/unsigned only matters if you want to saturate instead of truncate, but static_cast
doesn't saturate anything, it truncates.
Upvotes: 1
Reputation: 126203
The various PACK instructions convert int16->int8 or int32->int16 from a pair of mmx/xmm regs to a single reg, with either signed or unsigned saturation (for values that are out of range for the target type) There is no int64->int32 version though.
Upvotes: 0