Dcoder
Dcoder

Reputation: 379

Need help understanding bit reset function

I am new to C and find it difficult to comprehend the assembly level operations here. Can someone please help with that?

/**
 * Input: bitmap - u32bits*
 *        bitpos - position of the bit to be reset (range 0-31)
 * return: old value of the bit (0 if unset, 1 if set)
 **/
static inline u32bits resetbit(u32bits *bitmap, u32bits bitpos)
{
     u32bits oldbit;
    __asm__ __volatile__ (
         "btr    %2, (%1)\n" /* bit test and reset */
         "sbbl   %0, %0\n"   /* return the previous value*/
        : "=r"(oldbit)      /* "0" output  parameter */
        :                   /* input parameters */
           "0"(bitmap),      /* "1" */
           "r"(bitpos)       /* "2" */
        : "%cc", "memory"   /* clobbered registers */
        );
    return oldbit;
}

Upvotes: 1

Views: 372

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365257

This is https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

This needs volatile and the "memory" clobber because bitmap is only specified as a pointer input in a register, not a "+m"(*bitmap) memory operand. But note that bitpos can index outside *bitmap, so it's actually accessing the u32 at bitmap[bitpos>>5].

The "cc" (Condition Codes) clobber is unnecessary; x86 inline asm already implicitly clobbers condition codes. The "0"(bitmap) input constraint is a "matching" constraint that means the bitmap input must be in the same register as operand 0, the "=r"(oldbit) output. That seems unnecessary, gcc could already do that if it wanted to just using "r"(bitmap). But perhaps they wanted to avoid a false dependency on the output register: only some AMD CPUs treat sbb same,same as depending only on CF, not on the register value.

You could write this without volatile like so, by using a read/write memory operand that tells the compiler the whole array is an input and might be modified. (But not any other memory, so it can still keep other globals in registers.) Casting to a pointer-to-array and dereferencing is kind of a hack, but (officially I think) supported by GCC.

We can also use GCC6 syntax for condition-code outputs from asm, instead of inlining the SBB. Often you just want to branch on this, or use it as a predicate for cmovcc or setcc, so a 0/-1 result would have to be turned back into EFLAGS with a compiler-generated test instruction.

typedef unsigned u32bits;

#ifndef __GCC_ASM_FLAG_OUTPUTS__
#error flag outputs unsupported
#endif

static inline 
u32bits resetbit(u32bits *bitmap, u32bits bitpos)
{
     u32bits oldbit;
    __asm__ (
         "btrl   %[bitidx], %[base]\n"  /* bit test and reset */
        : "=@ccc"(oldbit)      // output = c condition (carry set)
          ,[base]   "+m"( *(u32bits(*)[])bitmap )   // whole "array" of arbitrary length is a read/write input
        :  [bitidx] "Kr"(bitpos)           // signed imm8 or register
        :
        );
    oldbit = -oldbit;  // if you insist on a 0 / -1 result
    return oldbit;
}

If I had used [base] "+m"( *bitmap ) for the memory operand, gcc would assume that bitmap[4] = tmp; ahead of the asm was not an input, and could delay that until after the asm if it wanted to. Casting to pointer-to-array and dereferencing tells the compiler the whole array, or anything reachable through that pointer, is an input to the asm. (Even with a "memory" clobber, this is important for not optimizing away without asm volatile. But it also lets us avoid a "memory" clobber.) Exactly describing the inputs and outputs of an asm statement to the compiler is almost always better than using asm volatile, because sometimes it can optimize.

Note the "Kr" constraint that allows the bit-position to be an immediate. This leads to much better code (but still maybe sub-optimal vs. not using inline asm at all) if the bit-index is a small compile time constant after inlining. I had to use btrl to make the operand-size explicit in case neither operand is a register.

Compiles with gcc8.2 -O3 to a somewhat clunky version. (But would be fine if inlining into something like if(resetbit(a,b)) where the -oldbit could optimize away.)

# gcc8.2 -O3 for x86-64 System V
resetbit(unsigned int*, unsigned int):
        btr   %esi, (%rdi)          # inline asm

        setc    %al                 # compiler generated
        movzbl  %al, %eax
        negl    %eax
        ret

But clang7.0 doesn't support __GCC_ASM_FLAG_OUTPUTS__, but clang9.0 trunk does, and chooses on its own to use SBB when it needs to return an integer this way.

# clang version 9.0.0 (trunk 354240) -O3 for x86-64 System V
resetbit(unsigned int*, unsigned int):
        btrl    %esi, (%rdi)        # inline asm

        sbbl    %eax, %eax          # compiler generated
        retq

A test caller like this can inline it. !! booleanizes back to 0 / 1, removing the need for neg.

int test_smallconst(u32bits *bitmap) { return !!resetbit(bitmap, 125);  }
# clang9 trunk -O3
test_smallconst(unsigned int*):
        xorl    %eax, %eax        # compiler generate setup for setcc
        btrl    $125, (%rdi)      # inline asm

        setb    %al               # compiler generated
        retq

Note that memory-destination btr mem, reg is very slow

That's because of the very unusual behaviour of treating the memory operand as a base address for a bit-index, instead of the actual dword of memory to be addressed. (memory-destination bt* instructions don't mask a register bit-index like they do for a register destination.) On modern x86, it can be faster to emulate btr with instructions that calculate the bit index and mask with a shift. So writing it with pure C is probably a win.

Upvotes: 1

user4520
user4520

Reputation: 3457

The BTR instruction will store the selected bit in EFLAGS.CF and the clear that bit in the source operand (bitmap).

Next, SBB is executed, and oldbit is specified as both source and destination. SBB will subtract the source and the value of the CF flag from the destination operand. Essentially, here, we're doing: oldbit = oldbit - oldbit - CF. As you can see, if CF is 0 (and remember, CF is set by the BTR instruction depending on whether the specified bit was set in the bitmap), oldbit will be set to 0, since the operation will be effectively: oldbit = oldbit - oldbit - 0.

Otherwise oldbit will be set to -1 (since CF = 1), which has all the bits set and signals that the original bit was set too.

Upvotes: 2

Related Questions