Reputation: 379
I am new to C and find it difficult to comprehend the assembly level operations here. Can someone please help with that?
/**
* Input: bitmap - u32bits*
* bitpos - position of the bit to be reset (range 0-31)
* return: old value of the bit (0 if unset, 1 if set)
**/
static inline u32bits resetbit(u32bits *bitmap, u32bits bitpos)
{
u32bits oldbit;
__asm__ __volatile__ (
"btr %2, (%1)\n" /* bit test and reset */
"sbbl %0, %0\n" /* return the previous value*/
: "=r"(oldbit) /* "0" output parameter */
: /* input parameters */
"0"(bitmap), /* "1" */
"r"(bitpos) /* "2" */
: "%cc", "memory" /* clobbered registers */
);
return oldbit;
}
Upvotes: 1
Views: 372
Reputation: 365257
This is https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
This needs volatile
and the "memory"
clobber because bitmap
is only specified as a pointer input in a register, not a "+m"(*bitmap)
memory operand. But note that bitpos
can index outside *bitmap
, so it's actually accessing the u32
at bitmap[bitpos>>5]
.
The "cc"
(Condition Codes) clobber is unnecessary; x86 inline asm already implicitly clobbers condition codes. The "0"(bitmap)
input constraint is a "matching" constraint that means the bitmap
input must be in the same register as operand 0, the "=r"(oldbit)
output. That seems unnecessary, gcc could already do that if it wanted to just using "r"(bitmap)
. But perhaps they wanted to avoid a false dependency on the output register: only some AMD CPUs treat sbb same,same
as depending only on CF, not on the register value.
You could write this without volatile
like so, by using a read/write memory operand that tells the compiler the whole array is an input and might be modified. (But not any other memory, so it can still keep other globals in registers.) Casting to a pointer-to-array and dereferencing is kind of a hack, but (officially I think) supported by GCC.
We can also use GCC6 syntax for condition-code outputs from asm, instead of inlining the SBB. Often you just want to branch on this, or use it as a predicate for cmovcc
or setcc
, so a 0/-1 result would have to be turned back into EFLAGS with a compiler-generated test
instruction.
typedef unsigned u32bits;
#ifndef __GCC_ASM_FLAG_OUTPUTS__
#error flag outputs unsupported
#endif
static inline
u32bits resetbit(u32bits *bitmap, u32bits bitpos)
{
u32bits oldbit;
__asm__ (
"btrl %[bitidx], %[base]\n" /* bit test and reset */
: "=@ccc"(oldbit) // output = c condition (carry set)
,[base] "+m"( *(u32bits(*)[])bitmap ) // whole "array" of arbitrary length is a read/write input
: [bitidx] "Kr"(bitpos) // signed imm8 or register
:
);
oldbit = -oldbit; // if you insist on a 0 / -1 result
return oldbit;
}
If I had used [base] "+m"( *bitmap )
for the memory operand, gcc would assume that bitmap[4] = tmp;
ahead of the asm was not an input, and could delay that until after the asm if it wanted to. Casting to pointer-to-array and dereferencing tells the compiler the whole array, or anything reachable through that pointer, is an input to the asm. (Even with a "memory"
clobber, this is important for not optimizing away without asm volatile
. But it also lets us avoid a "memory"
clobber.) Exactly describing the inputs and outputs of an asm
statement to the compiler is almost always better than using asm volatile
, because sometimes it can optimize.
Note the "Kr"
constraint that allows the bit-position to be an immediate. This leads to much better code (but still maybe sub-optimal vs. not using inline asm at all) if the bit-index is a small compile time constant after inlining. I had to use btrl
to make the operand-size explicit in case neither operand is a register.
Compiles with gcc8.2 -O3 to a somewhat clunky version. (But would be fine if inlining into something like if(resetbit(a,b))
where the -oldbit
could optimize away.)
# gcc8.2 -O3 for x86-64 System V
resetbit(unsigned int*, unsigned int):
btr %esi, (%rdi) # inline asm
setc %al # compiler generated
movzbl %al, %eax
negl %eax
ret
But clang7.0 doesn't support __GCC_ASM_FLAG_OUTPUTS__
, but clang9.0 trunk does, and chooses on its own to use SBB when it needs to return an integer this way.
# clang version 9.0.0 (trunk 354240) -O3 for x86-64 System V
resetbit(unsigned int*, unsigned int):
btrl %esi, (%rdi) # inline asm
sbbl %eax, %eax # compiler generated
retq
A test caller like this can inline it. !!
booleanizes back to 0 / 1, removing the need for neg
.
int test_smallconst(u32bits *bitmap) { return !!resetbit(bitmap, 125); }
# clang9 trunk -O3
test_smallconst(unsigned int*):
xorl %eax, %eax # compiler generate setup for setcc
btrl $125, (%rdi) # inline asm
setb %al # compiler generated
retq
btr mem, reg
is very slowThat's because of the very unusual behaviour of treating the memory operand as a base address for a bit-index, instead of the actual dword of memory to be addressed. (memory-destination bt*
instructions don't mask a register bit-index like they do for a register destination.) On modern x86, it can be faster to emulate btr
with instructions that calculate the bit index and mask with a shift. So writing it with pure C is probably a win.
Upvotes: 1
Reputation: 3457
The BTR
instruction will store the selected bit in EFLAGS.CF
and the clear that bit in the source operand (bitmap
).
Next, SBB
is executed, and oldbit
is specified as both source and destination. SBB
will subtract the source and the value of the CF
flag from the destination operand. Essentially, here, we're doing:
oldbit = oldbit - oldbit - CF
. As you can see, if CF
is 0
(and remember, CF
is set by the BTR
instruction depending on whether the specified bit was set in the bitmap
), oldbit
will be set to 0
, since the operation will be effectively:
oldbit = oldbit - oldbit - 0
.
Otherwise oldbit
will be set to -1
(since CF
= 1), which has all the bits set and signals that the original bit was set too.
Upvotes: 2