Reputation: 135

Extract 10bits words from bitstream

I need to extract all 10-bit words from a raw bitstream whitch is built as ABACABACABAC...

It already works with a naive C implementation like

for(uint8_t *ptr = in_packet; ptr < max; ptr += 5){
    const uint64_t val =
        (((uint64_t)(*(ptr + 4))) << 32) |
        (((uint64_t)(*(ptr + 3))) << 24) |
        (((uint64_t)(*(ptr + 2))) << 16) |
        (((uint64_t)(*(ptr + 1))) <<  8) |
        (((uint64_t)(*(ptr + 0))) <<  0) ;

    *a_ptr++ = (val >>  0);
    *b_ptr++ = (val >> 10);
    *a_ptr++ = (val >> 20);
    *c_ptr++ = (val >> 30);
}

But performance is inadequate for my application so I would like to improve this using some AVX2 optimisations.

I visited the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/# to find any functions that can help but it seems there is nothing to works with 10-bit words, only 8 or 16-bit. That seems logical since 10-bit is not native for a processor, but it make things hard for me.

Is there any way to use AVX2 to solve this problem?

Upvotes: 2

Answers (1)

Peter Cordes

Reputation: 365267

Your scalar loop does not compile efficiently. Compilers do it as 5 separate byte loads. You can express an unaligned 8-byte load in C++ with memcpy:

#include <stdint.h>
#include <string.h>

// do an 8-byte load that spans the 5 bytes we want
// clang auto-vectorizes using an AVX2 gather for 4 qwords.  Looks pretty clunky but not terrible
void extract_10bit_fields_v2calar(const uint8_t *__restrict src, 
   uint16_t *__restrict a_ptr, uint16_t *__restrict b_ptr, uint16_t *__restrict c_ptr,
   const uint8_t *max)
{
    for(const uint8_t *ptr = src; ptr < max; ptr += 5){
        uint64_t val;
        memcpy(&val, ptr, sizeof(val));

        const unsigned mask = (1U<<10) - 1; // unused in original source!?!
        *a_ptr++ = (val >>  0) & mask;
        *b_ptr++ = (val >> 10) & mask;
        *a_ptr++ = (val >> 20) & mask;
        *c_ptr++ = (val >> 30) & mask;
    }
}

ICC and clang auto-vectorize your 1-byte version, but do a very bad job (lots of insert/extract of single bytes). Here's your original and this function on Godbolt (with gcc and clang -O3 -march=skylake)

None of those 3 compilers are really close to what we can do manually.

Manual vectorization

My current AVX2 version of this answer forgot a detail: there are only 3 kinds of fields ABAC, not ABCD like 10-bit RGBA pixels. So I have a version of this which unpacks to 4 separate output streams (which I'll leave in because of the packed-RGBA use-case if I ever add a dedicated version for the ABAC interleave).

The existing version can use vpunpcklwd to interleave the two A parts instead of storing with separate vmovq should work for your case. There might be something more efficient, IDK.

BTW, I find it easier to remember and type instruction mnemonics, not intrinsic names. Intel's online intrinsics guide is searchable by instruction mnemonic.

Observations about your layout:

Each field spans one byte boundary, never two, so it's possible to assemble any 4 pairs of bytes in a qword that hold 4 complete fields.

Or with a byte shuffle, to create 2-byte words that each have a whole field at some offset. (e.g. for AVX512BW vpsrlvw, or for AVX2 2x vpsrld + word-blend.) A word shuffle like AVX512 vpermw would not be sufficient: some individual bytes need to be duplicated with the start of one field and end of another. I.e the source positions aren't all aligned words, especially when you have 2x 5 bytes inside the same 16-byte "lane" of a vector.

00-07|08-15|16-23|24-31|32-39     byte boundaries  (8-bit)
00...09|10..19|20...29|30..39     field boundaries (10-bit)

Luckily 8 and 10 have a GCD of 2 which is >= 10-8=2. 8*5 = 4*10 so we don't get all possible start positions, e.g. never a field starting at the last bit of 1 byte, spanning another byte, and including the first bit of a 3rd byte.

Possible AVX2 strategy: unaligned 32-byte load that leave 2x 5 bytes at the top of the low lane, and 2x 5 bytes at the bottom of the high lane. Then vpshufb in-lane shuffle to set up for 2x vpsrlvd variable-count shifts, and a blend.

Quick summary of a new idea I haven't expanded yet.

Given an input of xxx a0B0A0C0 a1B1A1C1 | a2B2A2C2 a3B3A3C3 from our unaligned load, we can get a result of
a0 A0 a1 A1 B0 B1 C0 C1 | a2 A2 a3 A3 B2 B3 C2 C3 with the right choice of vpshufb control.
Then a vpermd can put all of those 32-bit groups into the right order, with all the A elements in the high half (ready for a vextracti128 to memory), and the B and C in the low half (ready for vmovq / vmovhps stores).

Use different vpermd shuffles for adjacent pairs so we can vpblendd to merge them for 128-bit B and C stores.

Old version, probably worse than unaligned load + vpshufb.

With AVX2, one option is to broadcast the containing 64-bit element to all positions in a vector and then use variable-count right shifts to get the bits to the bottom of a dword element.

You probably want to do a separate 64-bit broadcast-load for each group (thus partially overlapping with the previous), instead of trying to pick apart a __m256i of contiguous bits. (Broadcast-loads are cheap, shuffling is expensive.)

After _mm256_srlvd_epi64, then AND to isolate the low 10 bits in each qword.

Repeat that 4 times for 4 vectors of input, then use _mm256_packus_epi32 to do in-lane packing down to 32-bit then 16-bit elements.

That's the simple version. Optimizations of the interleaving are possible, e.g. by using left or right shifts to set up for vpblendd instead of a 2-input shuffle like vpackusdw or vshufps. _mm256_blend_epi32 is very efficient on existing CPUs, running on any port.

This also allows delaying the AND until after the first packing step because we don't need to avoid saturation from high garbage.

Design notes:

shown as 32-bit chunks after variable-count shifts
[0 d0 0 c0 | 0 b0 0 a0]      # after an AND mask
[0 d1 0 c1 | 0 b1 0 a1]

[0 d1 0 c1 0 d0 0 c0 | 0 b1 0 a1 0 b0 0 a0]   # vpackusdw
shown as 16-bit elements but actually the same as what vshufps can do

---------

[X d0 X c0 | X b0 X a0]    even the top element is only garbage right shifted by 30, not quite zero
[X d1 X c1 | X b1 X a1]

[d1 c1 d0 c0 | b1 a1 b0 a0 ]   vshufps  (can't do d1 d0 c1 c0 unfortunately)

---------

[X  d0  X c0 |  X b0  X a0]   variable-count >>  qword
[d1 X  c1  X | b1  X a1  0]   variable-count <<  qword

[d1 d0 c1 c0 | b1 b0 a1 a0]   vpblendd

This last trick extends to vpblendw, allowing us to do everything with interleaving blends, no shuffle instructions at all, resulting in the outputs we want contiguous and in the right order in qwords of a __m256i.

x86 SIMD variable-count shifts can only be left or right for all elements, so we need to make sure that all the data is either left or right of the desired position, not some of each within the same vector. We could use an immediate-count shift to set up for this, but even better is to just adjust the byte-address we load from. For loads after the first, we know it's safe to load some of the bytes before the first bitfield we want (without touching an unmapped page).

# as 16-bit elements
[X X X d0  X X X c0 | ...]    variable-count >> qword
[X X d1 X  X X c1 X | ...]    variable-count >> qword from an offset load that started with the 5 bytes we want all to the left of these positions

[X d2 X X  X c2 X X | ...]    variable-count << qword
[d3 X X X  c3 X X X | ...]    variable-count << qword

[X d2 X d0  X c2 X c0 | ...]   vpblendd
[d3 X d1 X  c3 X c1 X | ...]   vpblendd

[d3 d2 d1 d0   c3 c2 c1 c0 | ...] vpblendw  (Same behaviour in both high and low lane)

Then mask off the high garbage inside each 16-bit word

Note: this does 4 separate outputs, like ABCD or RGBA->planar, not ABAC.

// potentially unaligned 64-bit broadcast-load, hopefully vpbroadcastq. (clang: yes, gcc: no)
// defeats gcc/clang folding it into an AVX512 broadcast memory source
// but vpsllvq's ymm/mem operand is the shift count, not data
static inline
__m256i bcast_load64(const uint8_t *p) {
    // hopefully safe with strict-aliasing since the deref is inside an intrinsic?
    __m256i bcast = _mm256_castpd_si256( _mm256_broadcast_sd( (const double*)p ) );
    return bcast;
}

// UNTESTED
// unpack 10-bit fields from 4x 40-bit chunks into 16-bit dst arrays
// overreads past the end of the last chunk by 1 byte
// for ABCD repeating, not ABAC, e.g. packed 10-bit RGBA
void extract_10bit_fields_4output(const uint8_t *__restrict src, 
   uint16_t *__restrict da, uint16_t *__restrict db, uint16_t *__restrict dc, uint16_t *__restrict dd,
   const uint8_t *max)
{
  // FIXME: cleanup loop for non-whole-vectors at the end    
  while( src<max ){
    __m256i bcast = bcast_load64(src);  // data we want is from bits [0 to 39], last starting at 30
    __m256i ext0 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0));  // place at bottome of each qword

    bcast = bcast_load64(src+5-2);        // data we want is from bits [16 to 55], last starting at 30+16 = 46
    __m256i ext1 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0));   // place it at bit 16 in each qword element

    bcast = bcast_load64(src+10);        // data we want is from bits [0 to 39]
    __m256i ext2 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32));   // place it at bit 32 in each qword element

    bcast = bcast_load64(src+15-2);        // data we want is from bits [16 to 55], last field starting at 46
    __m256i ext3 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32));   // place it at bit 48 in each qword element

    __m256i blend20 = _mm256_blend_epi32(ext0, ext2, 0b10101010);   // X d2 X d0  X c2 X c0 | X b2 ...
    __m256i blend31 = _mm256_blend_epi32(ext1, ext3, 0b10101010);   // d3 X d1 X  c3 X c1 X | b3 X ...

    __m256i blend3210 = _mm256_blend_epi16(blend20, blend31, 0b10101010);  // d3 d2 d1 d0   c3 c2 c1 c0 
    __m256i res = _mm256_and_si256(blend3210, _mm256_set1_epi16((1U<<10) - 1) );

    __m128i lo = _mm256_castsi256_si128(res);
    __m128i hi = _mm256_extracti128_si256(res, 1);
    _mm_storel_epi64((__m128i*)da, lo);     // movq store of the lowest 64 bits
    _mm_storeh_pi((__m64*)db, _mm_castsi128_ps(lo));       // movhps store of the high half of the low 128.  Efficient: no shuffle uop needed on Intel CPUs

    _mm_storel_epi64((__m128i*)dc, hi);
    _mm_storeh_pi((__m64*)dd, _mm_castsi128_ps(hi));       // clang pessmizes this to vpextrq :(
    da += 4;
    db += 4;
    dc += 4;
    dd += 4;
    src += 4*5;
  }
}

This compiles (Godbolt) to about 21 front-end uops (on Skylake) in the loop per 4 groups of 4 fields. (Including has a useless register copy for _mm256_castsi256_si128 instead of just using the low half of ymm0 = xmm0). This will be very good on Skylake. There's a good balance of uops for different ports, and variable-count shift is 1 uop for either p0 or p1 on SKL (vs. more expensive previously). The bottleneck might be just the front-end limit of 4 fused-domain uops per clock.

Replays of cache-line-split loads will happen because the unaligned loads will sometimes cross a 64-byte cache-line boundary. But that's just in the back-end, and we have a few spare cycles on ports 2 and 3 because of the front-end bottleneck (4 loads and 4 stores per set of results, with indexed stores which thus can't use port 7). If dependent ALU uops have to get replayed as well, we might start seeing back-end bottlenecks.

Despite the indexed addressing modes, there won't be unlamination because Haswell and later can keep indexed stores micro-fused, and the broadcast loads are a single pure uop anyway, not micro-fused ALU+load.

On Skylake, it can maybe come close to 4x 40-bit groups per 5 clock cycles, if memory bandwidth isn't a bottleneck. (e.g. with good cache blocking.) Once you factor in overhead and cost of cache-line-split loads causing occasional stalls, maybe 1.5 cycles per 40 bits of input, i.e. 6 cycles per 20 bytes of input on Skylake.

On other CPUs (Haswell and Ryzen), the variable-count shifts will be a bottleneck, but you can't really do anything about that. I don't think there's anything better. On HSW it's 3 uops: p5 + 2p0. On Ryzen it's only 1 uop, but it only has 1 per 2 clock throughput (for the 128-bit version), or per 4 clocks for the 256-bit version which costs 2 uops.

Beware that clang pessmizes the _mm_storeh_pi store to vpextrq [mem], xmm, 1: 2 uops, shuffle + store. (Instead of vmovhps : pure store on Intel, no ALU). GCC compiles it as written.

I used _mm256_broadcast_sd even though I really want vpbroadcastq just because there's an intrinsic that takes a pointer operand instead of __m256i (because with AVX1, only the memory-source version existed. But with AVX2, register-source versions of all the broadcast instructions exist). To use _mm256_set1_epi64, I'd have to write pure C that didn't violate strict aliasing (e.g. with memcpy) to do an unaligned uint64_t load. I don't think it will hurt performance to use an FP broadcast load on current CPUs, though.

I'm hoping _mm256_broadcast_sd allows its source operand to alias anything without C++ strict-aliasing undefined behaviour, the same way _mm256_loadu_ps does. Either way it will work in practice if it doesn't inline into a function that stores into *src, and maybe even then. So maybe a memcpy unaligned load would have made more sense!

I've had bad results in the past with getting compilers to emit pmovzxdw xmm0, [mem] from code like _mm_cvtepu16_epi32( _mm_loadu_si64(ptr) ); you often get an actual movq load + reg-reg pmovzx. That's why I didn't try that _mm256_broadcastq_epi64(__m128i).

Old idea; if we already need a byte shuffle we might as well use plain word shifts instead of vpmultishift.

With AVX512VBMI (IceLake, CannonLake), you might want vpmultishiftqb. Instead of broadcasting / shifting one group at a time, we can do all the work for a whole vector of groups after putting the right bytes in the right places first.

You'd still need/want a version for CPUs with some AVX512 but not AVX512VBMI (e.g. Skylake-avx512). Probably vpermd + vpshufb can get the bytes we need into the 128-bit lanes we want.

I don't think we can get away with using only dword-granularity shifts to allow merge-masking instead of dword blend after qword shift. We might be able to merge-mask a vpblendw though, saving a vpblendd

IceLake has 1/clock vpermw and vpermb, single-uop. (It has a 2nd shuffle unit on another port that handles some shuffle uops). So we can load a full vector that contains 4 or 8 groups of 4 elements and shuffle every byte into place efficiently. I think every CPU that has vpermb has it single-uop. (But that's only Ice Lake and the limited-release Cannon Lake).

vpermt2w (to combine 16-bit element from 2 vectors into any order) is one per 2 clock throughput. (InstLatx64 for IceLake-Y), so unfortunately it's not as efficient as the one-vector shuffles.

Anyway, you might use it like this:

64-byte / 512-bit load (includes some over-read at the end from 8x 8-byte groups instead of 8x 5-byte groups. Optionally use a zero-masked load to make this safe near the end of an array thanks to fault suppression)
vpermb to put the 2 bytes containing each field into desired final destination position.
vpsrlvw + vpandq to extract each 10-bit field into a 16-bit word

That's about 4 uops, not including the stores.

You probably want the high half containing the A elements for a contiguous vextracti64x4 and the low half containing the B and C elements for vmovdqu and vextracti128 stores.

Or for 2x vpblenddd to set up for 256-bit stores. (Use 2 different vpermb vectors to create 2 different layouts.)

You shouldn't need vpermt2w or vpermt2d to combine adjacent vectors for wider stores.

Without AVX512VBMI, probably a vpermd + vpshufb can get all the necessary bytes into each 128-bit chunk instead of vpermb. The rest of it only requires AVX512BW which Skylake-X has.

Upvotes: 7

Extract 10bits words from bitstream

Answers (1)

Manual vectorization

Observations about your layout:

Related Questions