Reputation: 468

How do I convert from larger integers to smaller integers using AVX2 and SSE?

Is there an efficient way to convert a larger integer type down to a smaller integer type (with truncation of course) using AVX2 and SSE?

Such as:

int16 -> int8
int32 -> int16 / int32 -> int8
int64 -> int32 / int64 -> int16 / int64 -> int8

I know AVX-512 has the instructions:

vpmovqb
vpmovwb

Which correspond to intrinsics like:

_mm512_cvtepi16_epi8 (AVX512 Byte and Word ISA)
_mm512_cvtepi32_epi8 (AVX512 Foundation)
_mm512_cvtepi32_epi16 (AVX512 Foundation)
_mm512_cvtepi64_epi8 (AVX512 Foundation)
_mm512_cvtepi64_epi16 (AVX512 Foundation)
_mm512_cvtepi64_epi32 (AVX512 Foundation)

which handle integer type narrowing conversions but how do you accomplish this in AVX2 and SSE which have no such instructions?

Please note that while there are 128 and 256 bit overloads for the above AVX512 intrinsics, they still require AVX512 at runtime. I'm looking for ways to accomplish the same things using only AVX2 and/or SSE instructions.

I know this a lot to ask, if you don't want to give me the full answer I understand. Just please explain to me or help me find an algorithm I can replicate for each conversion.

Also please specify if your answer will work for both signed and unsigned integers or otherwise.

Thank you very much.

To clarify; I don't plan on unpacking this data later. I want to use vectorized narrowing conversions in a number of applications such as:

Code page conversion (ie UTF32 -> UTF8) (I know it's not as simple as converting the data type)
Database translation (connecting 2 completely different databases and transferring data between them where the data types of the cells in the tables are different)
Fast Math approximations
and more!

pretty much anything you can do with: int32_t y = some_func(); int8_t x = static_cast<int8_t>(y);

But vectorized. I've measured the performance of our code (sorry NDA, can't show any here) and found that we spend a lot of time on type conversion. Particularly narrowing conversions. The compiler does a pretty good job of vectorizing small->BIG conversions with sign extension using the instructions available with AVX2 but narrowing conversions don't seem to get vectorized much at all.

Also to clarify some more. I want a means of casting that behaves like static_cast (as in with truncation)

Upvotes: 0

Answers (2)

Soonts

Reputation: 21956

One good way is pshufb instruction, exposed as _mm_shuffle_epi8 or _mm256_shuffle_epi8 intrinsics. On modern CPUs that instruction is pretty fast: the latency is 1 cycle, and they can run 2 of them per clock.

Here’s an example which converts int32 -> int16, untested. By changing the shuffle magic vector, you can implement the rest of the conversations with the same code.

// Convert int32 lanes into int16 with truncation
inline __m128i cvtepi32_epi16( __m256i x )
{
    // Make a shuffle constant vector
    // If you also need 16-byte version, move this to global variable,
    // to reuse the lower 16 bytes of the same constant
    const __m256i shuffle = _mm256_setr_epi8(
        0, 1, 4, 5, 8, 9, 12, 13,
        -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1,
        0, 1, 4, 5, 8, 9, 12, 13 );

    // Move these bytes
    // Unfortunately that instruction can only move within 16-byte halves.
    // Fortunately, it can selectively zero out bytes, so a bitwise OR is enough to combine
    x = _mm256_shuffle_epi8( x, shuffle );

    // Split into halves
    const __m128i low = _mm256_castsi256_si128( x );
    const __m128i high = _mm256_extracti128_si256( x, 1 );

    // Produce the result
    return _mm_or_si128( low, high );
}

For SSE vectors it’s even more efficient than for AVX2, you only need a single _mm_shuffle_epi8 instruction. BTW for both SSE and AVX, that instruction can load the second argument directly from memory, i.e. even if that code is not inlined, it does not requite a separate load to fetch the permutation constant.

P.S. Signed/unsigned is irrelevant for truncation, the downcasting operation is identical between the two. Signed/unsigned only matters if you want to saturate instead of truncate, but static_cast doesn't saturate anything, it truncates.

Upvotes: 1

Chris Dodd

Reputation: 126526

The various PACK instructions convert int16->int8 or int32->int16 from a pair of mmx/xmm regs to a single reg, with either signed or unsigned saturation (for values that are out of range for the target type) There is no int64->int32 version though.

Upvotes: 0

How do I convert from larger integers to smaller integers using AVX2 and SSE?

Answers (2)

Related Questions