Reputation: 23
For example 0x1230560181feab00
has 4 zero nibbles and 0x00000123456780ab
has 6. How can I compute this quickly without doing an naive loop and counting? Any cool bithacks for this?
Upvotes: 2
Views: 489
Reputation:
You might get some speedup by using a precomputed lookup-table of 256 or 65536 entries (1 byte or 1 short), telling you how many nibbles are zero. This costs you 256 or 64k bytes. A larger LUT would be unreasonable I guess.
A sheer speedup can be achieved if you allow SSE instructions, thanks to the magical _mm_movemask_epi8 operation, which packs the 16 MSb of 16 bytes to 16 bits.
You will need to mask out every other nibble, then compare the bytes to zero (_mm_cmpeq_epi8), then pack the bytes to bits, and use a precomputed LUT of 65536 entries telling how many zero bits there are.
Upvotes: 1
Reputation: 64903
Yes, first "gather the zeroness" of the whole nibble (the OR of all 4 bits):
x |= x >> 1;
x |= x >> 2;
Remove the junk:
x &= 0x1111111111111111UL;
Then just popcnt
that any way you want, if you have the instruction available that's great. Of course that gives the number of nibbles that isn't zero, but that's the same information, just subtract it from 16.
If you're going to use a fallback, some of them can be optimized because you know that every nibble is either 0 or 1, so for example in the typical approach that starts with ulong result = value - ((value >> 1) & 0x5555555555555555UL);
, you can skip two phases of reduction.
So you can use this:
x = (x + (x >> 4)) & 0xF0F0F0F0F0F0F0FUL;
count = (x * 0x101010101010101UL) >> 56
to do the final count.
Unfortunately trying to use that the nibbles are small there and therefore trying to remove the second to last reduction step by using a different multiplier only just doesn't work out - the final sum can be 16 but using (x * 0x1111111111111111UL) >> 60
cannot result in 16.
Upvotes: 4