Reputation: 25

Split a number into several numbers, each with only one significant bit

Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.

I want to isolate each set bit in a number. For example,

input:
01100100

output:

01000000 
00100000
00000100

Only comes to mind number & mask. Assembly or С++.

Upvotes: 0

Answers (3)

Peter Cordes

Reputation: 365911

If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see @harold's answer.

But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).

The key building block is AVX512F vpcompressd : given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.

An AVX512 ZMM / __m512i vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd or anything like that; we can just kmov it into a mask register and use it directly.)

See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?

#include <stdint.h>
#include <immintrin.h>

// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
    const __m512i bitmasks_lo = _mm512_set_epi32(
           1UL << 15,  1UL << 14,  1UL << 13,  1UL << 12,
           1UL << 11,  1UL << 10,  1UL << 9,   1UL << 8,
           1UL << 7,   1UL << 6,   1UL << 5,   1UL << 4,
           1UL << 3,   1UL << 2,   1UL << 1,   1UL << 0
     );
     const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16);    // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.

    __mmask16 set_lo = x;
    __mmask16 set_hi = x>>16;

    int count_lo = _mm_popcnt_u32(set_lo);  // doesn't actually cost a kmov, __mask16 is really just uint16_t
    _mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
    _mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);

    return _mm_popcnt_u32(x);
}

Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl; it doesn't depend on AVX512BW or DQ.)

# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
        movzx   ecx, si
        popcnt  eax, esi
        shr     esi, 16
        popcnt  edx, ecx
        kmovd   k1, ecx
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
        vpcompressd     zmmword ptr [rdi] {k1}, zmm0
        kmovd   k1, esi
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
        vpcompressd     zmmword ptr [rdi + 4*rdx] {k1}, zmm0
        vzeroupper
        ret

On Skylake-AVX512, vpcompressd zmm{k1}, zmm is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.

It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1} so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.

On KNL, vpcompressd zmm{k1} is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).

This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x values, so we could hoist the vmovdqa64 loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.

Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32 is still 1/clock, so presumably still port 5 as well.)

(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)

Latency: integer result is just one popcnt (3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.

Compacting the 1<<0..15 constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15) with vpmovzxbd, then using that with vpsllvd on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).

I would generate the hi part with a shift from lo, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.

You could compress the lo constant with a vpmovzxwd load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.

If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd into registers and maybe use count_lo to look up a shuffle control vector for vpermt2d. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.

Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.

You can do it in a loop, though, using vpslld by 8 to move bits in vector elements. You'd think kshiftrq would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr / kmov and movzx / popcnt. (Using a counter += 8 and bzhi to feed popcnt would cost more uops).

The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.

vector: vpslld immediate, starting from the vector constant
mask: shr r64, 8 starting with x. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)
pointer: lea rdi, [rdi + rax*4] where RAX holds a popcnt result.

The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd shuffles and kmov

Upvotes: 1

6502

Reputation: 114589

The standard way is

while (num) {
    unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
    ...
    num ^= mask;
}

for example starting with num = 2019 you will get in order

Upvotes: 1

user555045

Reputation: 64923

Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:

while (number) {
    // extract lowest set bit in number
    uint64_t m = number & -number;
    /// use m
    ...
    // remove lowest set bit from number
    number &= number - 1;
}

In modern x64 assembly, number & -number may be compiled to blsi, and number &= number - 1 may be compiled to blsr which are both fast, so this would only take a couple of efficient instructions to implement.

Since m is available, resetting the lowest set bit may be done with number ^= m but that may make it harder for the compiler to see that it can use blsr, which is a better choice because it depends only directly on number so it shortens the loop carried dependency chain.

Upvotes: 2

Split a number into several numbers, each with only one significant bit

Answers (3)

Related Questions