coptimizationbit-manipulationsimdfixed-point

Reputation: 115

Finding the position of the first set bit within a range of bits

In my code, I found that the processor is spending most of the time on the function shown below. The objective of the loop is that it should find out the val1 value that satisfies the condition present inside the loop. Variables Val1 and a are of type long long int (64 bit). And also, they are local non-static variables declared inside the function.

long long int findval(long long int x)
{

  long long int Val1,a=x;

  for (Val1 = 63; Val1 > 22; Val1--) 
  {
        if (((a >> Val1) & 1) == 1) 
            break;
  }

  return Val1;
}

Is there any other simple/optimized way to find out the Val1 value?

Upvotes: 1

Answers (4)

txk2048

Reputation: 321

If you can use a GCC intrisic then you could try something like this

Please note that this assumes that x is not 0 because the result of __builtin_clzll() is not defined when x is 0

#include <limits.h>

long long int findval(long long int x)
{
    // Get size of long long in bits
    size_t llsize = sizeof(long long) * CHAR_BIT;

    // Subtract count of leading zeros from size of long long
    return llsize - __builtin_clzll(x);
}

Upvotes: 1

Peter Cordes

Reputation: 364288

For some reason I thought this was tagged x86 and/or x86-64 at some point. My GNU C answer works with any ISA, but I focused on x86-specific intrinsics for MSVC, and how it compiles for x86 with GCC/clang. Unfortunately there isn't a fully portable way to do this efficiently, so it's definitely worth doing some #ifdef to take advantage of HW support for this operation on targets you care about.

It seems you want max(22, 63 - clz(x)), where clz is some count-leading-zeros function. e.g. in GNU C, __builtin_clzll(). 63-clz(x) is the position of the MSB, when long long = int64_t like it does on x86.

Your Val1 > 22 loop condition becomes false at Val1 = 22, so that's the non-break way out of the loop if no set bit is found by then.

__builtin_clzll has undefined behaviour when its input is zero (so it can compile to 63 - a bsr instruction on x86). We can handle this and the lower bound of 22 by setting that bit in the input before running a bit-scan.

#include <limits.h>
inline
int MSB_position_clamped (long long x)
{
    int maxpos = CHAR_BIT * sizeof(x) - 1;
    x |= 1LL << 22;              // avoid x==0 UB and make clz at least 22
    return maxpos - __builtin_clzll(x);
}

For MSVC, you'd want _BitScanReverse64 (slower on AMD) or 63 - _mm_lzcnt_u64 (requires BMI1). The _mm intrinsic version is available on all x86-64 compilers.

(As Mike points out, shift counts only need to be int. Wider shift counts are not helpful, especially when compiling for 32-bit machines where long long takes 2 registers).

This compiles efficiently for x86-64, especially with clang (Godbolt). We'd also expect it to inline efficiently to these 2 instructions.

# clang 9.0 -O3 for x86-64 System V
MSB_position_clamped:
        or      rdi, 4194304
        bsr     rax, rdi
        ret

(x86 legacy bit-scan instructions find the bit-position directly, like you want. BMI1 lzcnt is faster on AMD, but actually counts leading zeros so you do need to subtract it from the type width. Even when GCC uses BSR, it's fails to optimize 63 - clz back into just BSR; it flips it twice.)

Note that negative 2's complement integer have their MSB set even though the only significant bits are lower. Are you sure you want a signed type for this?

If so, are you sure you don't want GNU C __builtin_clrsbll? (Returns the number of Leading Redundant Sign Bits in x, i.e. the number of bits following the most significant bit that are identical to it) There's no single x86 instruction, but I assume it does it efficiently with a bit-scan on ~x and combine somehow.

Also, if your original code was intended to be fully portable to all ISO C implementations, I'm not sure it's guaranteed that the sign bit shifts to lower bit positions. I wouldn't expect it for signed right shifts on a sign/magnitude C implementation. (ISO C leaves it up to the implementation whether right shifts on signed integer types are logical or arithmetic; sane / good quality implementations pick arithmetic. With 2's complement integers your code would work either way; you don't care whether it shifts in zeros or copies of the sign bit.)

Many CPUs (not just x86) have bit-scan instructions that do this in one hardware instruction, but it's AFAIK not possible to write portable C that will compile to such an instruction. ISO C hasn't bothered to add standard functions that can use such instructions when they exist. So the only good options is compiler-specific extensions. (Some compilers do recognize popcount loops, but with your loop stopping at 22 instead of 0, it's unlikely that it would fit the pattern for CLZ recognition if any compilers even look for that.) Some languages are better at this than C, notably Rust has very well designed integer primitives that include bit-scans.

GNU C __builtin_clzll() compiles to a hardware instruction on ISAs that have one, or falls back to calling a library function if not. (IDK how efficient the fallback is; it might use a byte or nibble at a time LUT instead of naive shifting.)

On 32-bit x86, __builtin_clzll uses bsr on the low and high halves and combines the results with cmov or a branch. The pure intrinsics like _BitScanReverse64 and _mm_lzcnt_u64 aren't available in 32-bit mode so you'd have to do that yourself if you use intrinsics instead of GNU C "portable" builtin functions.

32-bit code is not as nice as 64-bit code, but it's still non-looping. (And your loop gets very inefficient; GCC doesn't "think of" trying the high 32 bits in a separate loop before the low 32 bits, so it has to shrd / sar and then cmov based on a bit-test for the shift count being above 32. (Godbolt). Clang still fully unrolls, and does take advantage of only testing the relevant half of the number.

Since you tagged this SIMD, x86 AVX512CD actually has an instruction for lzcnt on 2, 4, or 8x int64_t element in one vector register: vplzcntq. The intrinsic is __m512i _mm512_lzcnt_epi64(__m512i a);.

All real CPUs with any AVX512 support have AVX512CD.

On Skylake-X and Ice Lake, it decodes to a single uop with 4 cycle latency, 0.5 clock throughput. (https://uops.info/). (It looks like it runs on the same ports as FMA/mul/add FP instructions, probably using the same hardware that normalizes floating-point mantissas, an operation that also requires finding the MSB.)

So hopefully GCC and clang can auto-vectorize code that uses __builtin_clzll when you compile with -march=skylake-avx512, or with -march=native on such machines.

Upvotes: 3

user1196549

Reputation:

Warning: I wrote my answer the wrong way (first bit on the right), sorry. Anyway, the approaches can be easily adapted to the MSb.

You can shortcut the process by means of a lookup table. You precompute the index of the rightmost bit for all numbers from 0 to 2^k-1. You will process your number in slices of k bits at a time, and try the slices from right to left, until the slice is nonzero.

An interesting option is to map your long long to an array of eight bytes; the bytes correspond to a lookup-table of 256 entries. This way, you benefit from direct by-byte addressing.

Processing by shorts is also possible, at the expense of a LUT of 65536 (64K) entries. The optimum might lie in between. There are cache effects.

Another useful approach is by dichotomy: mask out the 32 high order bits (or load the low int) and test for zero. Then with the nonzero part, mask out the 16 high order bits, and so on. In the end, use the LUT trick. With just 3 steps, you reduce from 64 to 8.

This is appropriate if the distribution of the bit index is uniform. If it is biased towards small values, sequential search can be better anyway.

Upvotes: 1

Mike Nakis

Reputation: 61993

First of all, keep in mind that just because you have found that the processor is spending most of the time on that ~~function~~ snippet, it does not mean that there is a problem with the snippet. Maybe you should try and find out why your code is invoking that snippet so often.

Secondly, since you have come here asking for help, you might as well show us everything you have, instead a subset of what you have which you believe should be enough for us to figure out what is wrong. Most importantly, you really should show us exactly how your variables are declared and also exactly where they are declared. Are they function-local? Are they static? Could it be that you have declared something as volatile? Nothing is irrelevant, everything counts.

In any case, if we are to assume that the snippet can be optimized, then I'd say the following:

Your Val1 should not be a long long int, because its values are only in the range between 23 and 63. So, it should be an int instead.

(If for some reason Val1 must be calculated as a long long int, then try casting it to another variable which is of type int before the loop, and use that variable in the loop.)

If you try that, then the compiler might be able to figure out that what you are trying to do is find the first non-zero bit within a range of bits, and replace your entire loop with a single machine instruction.

Upvotes: 2

Finding the position of the first set bit within a range of bits

Answers (4)

Related Questions