Reputation: 25
Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.
I want to isolate each set bit in a number. For example,
input:
01100100
output:
01000000
00100000
00000100
Only comes to mind number & mask
.
Assembly or С++.
Upvotes: 0
Views: 450
Reputation: 364118
If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see @harold's answer.
But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).
The key building block is AVX512F vpcompressd
: given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.
An AVX512 ZMM / __m512i
vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd
or anything like that; we can just kmov
it into a mask register and use it directly.)
See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?
#include <stdint.h>
#include <immintrin.h>
// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
const __m512i bitmasks_lo = _mm512_set_epi32(
1UL << 15, 1UL << 14, 1UL << 13, 1UL << 12,
1UL << 11, 1UL << 10, 1UL << 9, 1UL << 8,
1UL << 7, 1UL << 6, 1UL << 5, 1UL << 4,
1UL << 3, 1UL << 2, 1UL << 1, 1UL << 0
);
const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16); // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.
__mmask16 set_lo = x;
__mmask16 set_hi = x>>16;
int count_lo = _mm_popcnt_u32(set_lo); // doesn't actually cost a kmov, __mask16 is really just uint16_t
_mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
_mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);
return _mm_popcnt_u32(x);
}
Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl
; it doesn't depend on AVX512BW or DQ.)
# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
movzx ecx, si
popcnt eax, esi
shr esi, 16
popcnt edx, ecx
kmovd k1, ecx
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
vpcompressd zmmword ptr [rdi] {k1}, zmm0
kmovd k1, esi
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
vpcompressd zmmword ptr [rdi + 4*rdx] {k1}, zmm0
vzeroupper
ret
On Skylake-AVX512, vpcompressd zmm{k1}, zmm
is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.
It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1}
so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.
On KNL, vpcompressd zmm{k1}
is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).
This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x
values, so we could hoist the vmovdqa64
loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.
Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd
is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32
is still 1/clock, so presumably still port 5 as well.)
(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word
can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)
Latency: integer result is just one popcnt
(3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.
Compacting the 1<<0..15
constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15)
with vpmovzxbd
, then using that with vpsllvd
on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd
+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).
I would generate the hi
part with a shift from lo
, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.
You could compress the lo constant with a vpmovzxwd
load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.
If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd
into registers and maybe use count_lo
to look up a shuffle control vector for vpermt2d
. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.
Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.
You can do it in a loop, though, using vpslld
by 8 to move bits in vector elements. You'd think kshiftrq
would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr
/ kmov
and movzx
/ popcnt
. (Using a counter += 8 and bzhi
to feed popcnt would cost more uops).
The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.
vpslld
immediate, starting from the vector constantshr r64, 8
starting with x
. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)lea rdi, [rdi + rax*4]
where RAX holds a popcnt result.The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd
shuffles and kmov
Upvotes: 1
Reputation: 114461
The standard way is
while (num) {
unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
...
num ^= mask;
}
for example starting with num = 2019
you will get in order
1
2
32
64
128
256
512
1024
Upvotes: 1
Reputation: 64904
Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:
while (number) {
// extract lowest set bit in number
uint64_t m = number & -number;
/// use m
...
// remove lowest set bit from number
number &= number - 1;
}
In modern x64 assembly, number & -number
may be compiled to blsi
, and number &= number - 1
may be compiled to blsr
which are both fast, so this would only take a couple of efficient instructions to implement.
Since m
is available, resetting the lowest set bit may be done with number ^= m
but that may make it harder for the compiler to see that it can use blsr
, which is a better choice because it depends only directly on number
so it shortens the loop carried dependency chain.
Upvotes: 2