Reputation: 171
I want to create a function for addition two 16-bit integers with overflow detection. I have generic variant written in portable c. But the generic variant is not optimal for x86 target, because CPU internally calculate overflow flag when execute ADD/SUB/etc. Of course, there is__builtin_add_overflow()
, but in my case it generates some boilerplate.
So I write the following code:
#include <cstdint>
struct result_t
{
uint16_t src;
uint16_t dst;
uint8_t of;
};
static void add_u16_with_overflow(result_t& r)
{
char of, cf;
asm (
" addw %[dst], %[src] "
: [dst] "+mr"(r.dst)//, "=@cco"(of), "=@ccc"(cf)
: [src] "imr" (r.src)
: "cc"
);
asm (" seto %0 " : "=rm" (r.of) );
}
uint16_t test_add(uint16_t a, uint16_t b)
{
result_t r;
r.src = a;
r.dst = b;
add_u16_with_overflow(r);
add_u16_with_overflow(r);
return (r.dst + r.of); // use r.dst and r.of for prevent discarding
}
I've played with https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) and it results
test_add(unsigned short, unsigned short):
seto %al
movzbl %al, %eax
addw %si, %di
addw %si, %di
addl %esi, %eax
ret
So, seto %0
is reordered. It seems gcc think there is no dependency between two consequent asm()
statements. And "cc" clobber doesn't have any effect for flags dependency.
I can't use volatile
because seto %0
or whole function can be (and have to) optimized out if result (or some part of result) is not used.
I can add dependency for r.dst: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) );
, and reordering will not happen. But it is not a "right thing", and compiler still can insert some code changes flags (but not changes r.dst) between add
and seto
statement.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
Upvotes: 3
Views: 691
Reputation: 365457
I haven't looked at gcc's output for __builtin_add_overflow
, but how bad is it? @David's suggestion to use it, and https://gcc.gnu.org/wiki/DontUseInlineAsm is usually good, especially if you're worried about how this will optimize. asm
defeats constant propagation and some other things.
Also, if you are going to use ASM, note that att syntax is add %[src], %[dst]
operand order. See the tag wiki for details, unless you're always going to build your code with -masm=intel
.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
No. Put the flag-consuming instruction (seto
) inside the same asm
block as the flag-producing instruction. An asm
statement can have an many input and output operands as you like, limited only by register-allocation difficulty (but multiple memory outputs can use the same base register with different offsets). Anyway, an extra write-only output on the statement containing the add
isn't going to cause any inefficiency.
I was going to suggest that if you want multiple flag outputs from one instruction, use LAHF to Load AH from FLAGS. But that doesn't include OF, only the other condition codes. This is often inconvenient and seems like a bad design choice because there are some unused reserved bits in the low 8 of EFLAGS/RFLAGS, so OF could have been in the low 8 along with CF, SF, ZF, PF, and AF. But since that isn't the case, setc
+ seto
are probably better than pushf
/ reload, but that is worth considering.
Even if there was syntax for flag-input (like there is for flag-output), there would be very little to gain from letting gcc insert some of its own non-flag-modifying instructions (like lea
or mov
) between your two separate asm
statements.
You don't want them reordered or anything, so putting them in the same asm statement makes by far the most sense. Even on an in-order CPU, add
is low latency so it's not a big bottleneck to put a dependent instruction right after it.
And BTW, a jcc
might be more efficient if overflow is an error condition that doesn't happen normally. But unfortunately GNU C asm goto
doesn't support output operands. You could take a pointer input and modify dst
in memory (and use a "memory"
clobber), but forcing a store/reload sucks more than using setc
or seto
to produce an input for a compiler-generated test
/jnz
.
If you didn't also need an output, you could put C labels on a return true
and a return false
statement, which (after inlining) would turn your code into a jcc to wherever the compiler wanted to lay out the branches of an if()
. e.g. see how Linux does it: (with extra complicating factors in these two examples I found): setting up to patch the code after checking a CPU feature once at boot, or something with a section for a jump table in arch_static_branch
.)
Upvotes: 3