Reputation: 1
This simple code should copy the string "c" in the string "d", changing only the first char to 'x':
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb $'x', (%%ecx)\n"
"movb 1(%%ebx), %%al\n"
"movb %%al, 1(%%ecx)\n"
"movb 2(%%ebx), %%al\n"
"movb %%al, 2(%%ecx)\n"
"movb 3(%%ebx), %%al\n"
"movb %%al, 3(%%ecx)\n"
"movb $0, 4(%%ecx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("%s\n", d);
return 0;
}
But it gives a segmentation fault error. I believe my problem is with the constraints, but I can't figure how to fix this.
What is the right way, and how can I change this code to work?
Upvotes: 0
Views: 3037
Reputation: 244732
Yes, the input/output operands are wrong. The format is this:
__asm__("<instructions>\n\t"
: OutputOperands
: InputOperands
: Clobbers);
You have the inputs and outputs backwards. You have c
as an output, when it should be an input (since you're reading from it). You have d
as an input, when it should be an output (since you're writing c
to it).
Thus, your inline assembly should be written as follows:
__asm__("leal %1, %%ebx\n\t"
"leal %0, %%ecx\n\t"
"movb $'x', (%%ecx)\n\t"
"movb 1(%%ebx), %%al\n\t"
"movb %%al, 1(%%ecx)\n\t"
"movb 2(%%ebx), %%al\n\t"
"movb %%al, 2(%%ecx)\n\t"
"movb 3(%%ebx), %%al\n\t"
"movb %%al, 3(%%ecx)\n\t"
"movb $0, 4(%%ecx)"
: "=m" (d)
: "m" (c)
: "%ebx", "%ecx", "%eax"
);
But, you are also not making the most efficient use of the operands. You have several manual load operations (lea
) that you've written in assembly. You don't need to write these; that's the whole point of the Gnu extended inline assembly syntax—the compiler will generate the necessary load and store instructions for you. Not only does that make the code simpler and easier to write and maintain, but it also makes it more efficient, because the compiler can better schedule/arrange the loads and stores within surrounding code, and skip the lea
instructions entirely.
Making these modifications to use operands more efficiently, as well as using names for the operands to make the code easier to read, you would have:
__asm__("movb $'x', (%[d])\n\t"
"movb 1(%[c]), %%al\n\t"
"movb %%al, 1(%[d])\n\t"
"movb 2(%[c]), %%al\n\t"
"movb %%al, 2(%[d])\n\t"
"movb 3(%[c]), %%al\n\t"
"movb %%al, 3(%[d])\n\t"
"movb $0, 4(%[d])"
: "=m" (d) // dummy arg: tell the compiler we write all of d[]
: [c] "r" (c)
, [d] "r" (d)
, "m" (c) // unused dummy arg: tell the compiler we read all of c[]
: "%eax"
);
We're asking the compiler for the pointers to be in registers with the r
constraint, so we can still choose the addressing mode (reg+displacement) ourselves, as in your original. This causes the compiler to implicitly generate the two required lea
instructions. Not only does this make the code simpler to write, but it also lets the compiler choose which registers it wants to use, which can make the code more efficient. (For example, it needs d
in %rdi
as an arg for printf
. Compiler-generated setup for asm
statements is optimized along with normal code, so it doesn't have to repeat this work like it would if you wrote the lea
explicitly in asm. Leave as much as possible to the compiler, so it can optimize away when possible.)
Note that asking for a pointer with an r
constraint doesn't imply that you dereference it. Thus, we use "m" and "=m"
dummy memory operands to tell the compiler what memory is read and written, so it will ensure that the contents match program-order even in a more complex case where your function is inlined into another function that modifies c[]
and d[]
before and after. This works well because c[]
and d[]
are true C arrays, with static size. It wouldn't work if they were just pointers that you got from function args. In that case, "=m" (d)
would tell the compiler that the asm writes a pointer value into a memory location, not the pointed-to contents. "=m" (*d)
would tell the compiler that the asm writes the first byte. As the official docs point out, you could write something ugly using a GNU C statement-expression like:
{"m"( ({ const struct { char x[5]; } *p = (const void *)c ; *p; }) )}
Or, you could instead use a "memory"
clobber to tell the compiler that all memory must be in sync. With no output operands at all, the asm
block would be implicitly __volatile__
, which also prevents reordering. But if you had one unused dummy output to let the compiler choose a scratch register (see below), and didn't manually use __volatile__
, then the compiler would prove to itself that you never use the results and optimize out the entire block of inline assembly! (It's better to tell the compiler in as much detail as possible how your asm interacts with C variables, rather than relying on __volatile__
.)
Letting the compiler choose the addressing mode will work fine for us. It avoids an extra compiler-generated lea
instruction ahead of the asm block, and it simplifies the constraints because we actually use the memory operands instead of separately asking for pointers in registers.
(The compiler could still have avoided an lea
in the other version if it put c[]
or d[]
at esp+0
, so the pointer-register operand could be esp
).
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %%al\n\t"
"movb %%al, 1 + %[d]\n\t"
"movb 2 + %[c], %%al\n\t"
"movb %%al, 2 + %[d]\n\t"
"movb 3 + %[c], %%al\n\t"
"movb %%al, 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d) // not sure if early-clobber is needed,
// e.g. if the compiler would otherwise be allowed to put an output memory operand at the same address as an input operand.
// It's an error with gcc 4.7 and earlier, but gcc that old also doesn't accept "m"(c) as an input memory operand
: [c] "m" (c)
: "%eax"
);
See also Looping over arrays with inline assembly for more discussion of picking addressing mode yourself vs. using "m"
constraints to let the compiler pick. (If you don't want to get into that level of optimization, you probably shouldn't be using inline asm in the first place.)
The compiler will turn 3 + %[c]
into something like 3 + 6(%rsp)
, which the assembler will evaluate the same as 9(%rsp)
. Fortunately, it's not a syntax error if the substitution ends up producing 3 + (%rdi)
. (You do get a warning, though: Warning: missing operand; zero assumed
).
It would also be correct to use an "o"
constraint to request an "offsetable" memory operand, but all x86 addressing modes are offsetable (you can add a compile-time-constant displacement and they're still valid), so "m" should always work. (It would be nice if "o"
would add an explicit 0
to avoid the assembler warning, but it doesn't).
But we're not done with possible optimizations yet. We're still forcing the compiler to clobber the eax
register when we don't actually need to use that one—any general-purpose register will do. So, we introduce another output, this time a write-only (but early-clobber) temporary stored in a register:
char temp;
__asm__("movb $'x', %[d]\n\t"
"movb 1 + %[c], %[temp]\n\t"
"movb %[temp], 1 + %[d]\n\t"
"movb 2 + %[c], %[temp]\n\t"
"movb %[temp], 2 + %[d]\n\t"
"movb 3 + %[c], %[temp]\n\t"
"movb %[temp], 3 + %[d]\n\t"
"movb $0, 4 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
: // no clobbers
);
The early-clobber is necessary to stop the compiler from choosing a register that is also used in the addressing-modes for c
or d
. The asm syntax is designed to efficiently wrap a single instruction which reads all its inputs before writing any of its outputs.
Okay, we've made the interface between the inline assembly block and the surrounding compiler-generated code pretty much optimal—but let's look at the actual assembly language instructions we're using inside of it. These are far from optimal: we're writing one byte at a time when we could be writing four bytes at a time! (And, on a 64-bit build, we could be writing eight bytes at a time, but that wouldn't help us here.) So, let's just do:
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl 1 + %[c], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d)
, [temp] "=&r" (temp)
: [c] "m" (c)
:
);
This writes the first byte (an 'x' character) into d
, and then copies 4 bytes from c
into d
. That will include the terminating NUL character from c
(automatically appended to string literals by a C or C++ compiler), so the string in d
is already NUL-terminated without needing to append an additional byte.
Shorter and faster, except for the store-forwarding stall from reading the last 4 bytes of c[]
right after the compiler-generated initialization code stored the first 4 bytes and then a separate byte store of the terminating 0
. You wouldn't have this problem if you used static const char c[] = "abcd";
, (because then it would be in static storage instead of stored to the stack with mov-immediate every time the function runs), or if c[]
was a function arg that probably wasn't just written. Out-of-order execution can hide the store-forwarding stall latency, so it's probably worth it if c[]
is usually not just-written.
Notice that we are not reading from the first character of c
—we just offset it as part of the movl
instruction. We could tell the compiler about that to allow it to optimize by moving stores to c[0]
across the asm
statement. We could even ask for a [cplus1] "r" (&c[1])
input operand, which would be good if we needed the address in a register. (See the original version of this answer for that.)
Since it's exactly 4 bytes, we can cast to a 4-byte integer type, rather than defining a struct with a char[4]
member or something. Remember that a memory operand refers to a value in memory, so you have to dereference a pointer. Arrays are a special case: "m" (c)
references the 5-byte contents of c[]
, not the 4 or 8-byte pointer value. But as soon as we start casting, we just have a pointer. Even a function argument like int foo(const char c[static 5])
works like a char*
, not a char [5]
. Anyway, the *(const uint32_t*)&c[1]
is 4 bytes in memory from c[1]
to c[3]
. GCC warns about strict-aliasing with that cast, so maybe a struct { char c[4]; }
would be better. (gcc8-snapshot 20170628 doesn't warn. Maybe the code is fine, or maybe the warning is broken in that unstable gcc version.)
// tightest constraints possible: 4 byte input memory operand, 5 byte output operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1] "m" (*(const uint32_t*)&c[1])
:
);
The code is looking pretty good now. Here's the code for the full function, as generated by GCC 6.3 on the Godbolt Compiler Explorer (with -O3 -m32
to generate 32-bit code like in the question):
subl $40, %esp
movl $1684234849, 18(%esp) # store 'abcd' into c
movb $0, 22(%esp) # NUL-terminate c
# begin inline-asm block
movb $'x', 23(%esp) # write initial 'x' into d[0]
movl 19(%esp), %eax # get 4 characters starting at c[1]
movl %eax, 1 + 23(%esp) # write those 4 characters into d, starting at d[1]
# end inline-asm block
leal 23(%esp), %eax # load address of c[1] into EAX register
pushl %eax # push address of d[0] onto stack
call puts # call 'puts' to output string. printf("%s\n", d) optimizes to this.
xorl %eax, %eax
addl $44, %esp
ret
gcc decides to save a register by delaying the lea
until after the asm
block. With -m64
, it does lea
before the asm
, but it still uses a stack-pointer address instead of the register it just set up. That lets the loads/stores run without waiting for the lea
to execute, but it also wastes code-size. Since lea
is fast, it's not what I'd do if writing by hand.
The "r"
constraint version uses two separate subl
instructions to reserve stack space: subl $28, %esp
before initializing c[]
, and subl $12, %esp
right before the asm
block. This is just a missed optimization by the compiler, unlike the extra lea
which is unavoidable.
Notice that this is much much worse than the asm you'd get from the much more sensible:
d[0] = 'x';
memcpy(&d[1], &c[1], 4);
In that case, c[]
optimizes away entirely and you get almost the same code that char d[] = "xbcd";
would produce. (See test_memcpy()
in the Godbolt link above). The inline-asm version is only useful as an example or template for wrapping other memory-to-memory instruction sequences.
So how do we test that we got all the constraints right, allowing the compiler to optimize as far as correctness allows but no further? In this case, storing into c[]
and d[]
before and after the asm
statement provides a good check. Recent gcc versions really will combine those stores into a single store either before or after if the constraints allow it. (clang won't, though.)
int optimize_test(void) {
// static // const
char c[5] = "abcd";
char d[5];
c[3] = 'O'; // **not** optimized away: part of the 32-bit input memory operand
c[0] = '0'; // merged with the c[0]='1' after the asm, because the asm doesn't read this part of c[]
d[3] = 'E'; // optimized away because the whole d[] is an output-only operand
unsigned int temp;
__asm__("movb $'x', %[d]\n\t"
"movl %[cplus1], %[temp]\n\t"
"movl %[temp], 1 + %[d]"
: [d] "=&m" (d) // references the contents of the whole array, not the pointer-value or just d[0]
, [temp] "=&r" (temp)
: [cplus1 "m" (*(const uint32_t*)&c[1])
:
);
c[0] = '1'; // these dead stores into c[] are not optimized away, for some reason. (Even with memcpy instead of an asm statement.)
c[3] = 'M';
d[3] = 'D';
printf("%s\n", d);
return 0;
}
There are a couple of additional tweaks that you could do with the inline assembly. For example, our clobbers are telling the compiler that it cannot re-use one of the input registers for the temp register, but it actually could. But these are all pretty subtle. If you actually cared about getting the best possible code from the compiler, you'd write the above code in C like I just showed.
There are many reasons not to use inline assembly, including performance: you'll probably just defeat the compiler's ability to optimize. If the compiler isn't doing a good job somewhere (for a specific compiler version for a specific target architecture), often you can coax it into making better assembly by just changing the C source, without resorting to inline asm. (Although it's often possible for an expert that really knows what they're doing to beat the compiler, this often requires writing the entire loop in asm and requires a significant investment in time. And if you don't know what you're doing, you can easily make it slower.)
If you're interested in learning assembly language, you should be using an assembler to write the code, not a C compiler. This is all just busy-work! It took me way too long to write this answer, and had to get help from other experts to ensure that I got all of the constraints and clobbers precisely correct so as to cause optimal code to be generated, and I know what I'm doing! This would have been a 2-minute task in assembly:
lea eax, DWORD PTR [d]
lea edx, DWORD PTR [c+1]
mov BYTE PTR [eax], 'x'
mov edx, DWORD PTR [edx]
mov DWORD PTR [eax+1], edx
…and you can easily verify that it is correct!
Extra notes from @PeterCordes: If we can assume that these strings are constants/literals, then this would actually be much better:
mov DWORD PTR [d], 'xbcd' ; 0x64636278
mov BYTE PTR [d+4], 0
where d
can be any addressing mode, for example [esp+6]
. If we just want to pass the string to a function, writing in pure asm lets us do things like this that the compiler wouldn't, giving excellent code size and performance:
push 0 ; includes 3 extra bytes of 0 padding, but gcc was leaving more than that already
push 'xbcd' ; ESP is now pointing to the string data we just pushed
push esp ; pushes the old value. (push stack-pointer costs 1 extra uop on Intel CPUs, and AMD Ryzen, but the LEA or MOV we avoid would also be a uop).
call puts
Making the compiler store into c[]
and then reloading that inside the asm statement is just silly. You could achieve this by passing in the data as a 4-byte integer with an "ri"
constraint. Or maybe using if (__builtin_constant_p(data)) { } else { }
to branch on whether the data was a compile-time constant or not.
If the contents of c[]
aren't supposed to be a compile-time constant, and if we can assume an offset load from c[]
won't cause a store-forwarding stall, the general idea of Cody's final version is good:
lea rdi, [d] ; or "mov edi, OFFSET d" if you don't need a 64-bit RIP-relative LEA for PIC code
mov edx, DWORD PTR [c+1] ; load before store to avoid any potential false dependency
mov BYTE PTR [rdi], 'x'
mov DWORD PTR [rdi+1], edx
The lea
is only worth it if we need d
's address in a register afterwards (which we do in this case for printf
/ puts
). Otherwise it's better to just use [d]
and [d+1]
, even if the addressing mode needs a 32-bit displacement. (It doesn't in this case, since c
and d
are both on the stack).
Or, if there's padding after d[]
, and targeting 64-bit, we could load 8 bytes from c
(if you know that the load won't cross into another page—a cache-line split on the load or store might also make this not worth it for perf reasons):
lea rdi, [d]
mov rdx, QWORD PTR [c]
mov QWORD PTR [rdi], rdx
mov BYTE PTR [rdi], 'x' ; overlapping store: rewrite the first byte
On some CPUs, e.g. Intel since Ivy Bridge, this will be good even if c[]
was just written (avoids the store-forwarding stall):
mov edx, DWORD PTR [c]
mov dl, 'x' ; modify the low byte. reading edx later will cause a partial-reg stall on older Intel CPUs
mov byte ptr[d+4], 0
mov dword ptr[d], edx
There are other ways to replace the first byte, e.g. AND and OR, which avoid problems on older Intel CPUs.
This has the advantage that reading multiple bytes at once from the start of d[]
won't suffer a store-forwarding stall, since the first 4 bytes are written with a store aligned to the start of d[]
.
Combining both previous ideas:
mov rdx, QWORD PTR [c]
mov dl, 'x'
mov QWORD PTR [d], rdx
As usual, the optimal choice strongly depends on context (surrounding code), and on target CPU microarchitecture (Nehalem vs. Skylake vs. Silvermont vs. Bulldozer vs. Ryzen ...)
Upvotes: 2
Reputation: 64
First of all, your code for string copy did not result in an exception, when I built using gcc and executed on my Windows PC. However, the string copy was not happening because your code appears to assume that register ecx points to variable d, when it actually points to variable c. The following code copies string contents of variable c to d, then replaces the first character in array d, with x. Try Compiling with gcc.
#include <stdio.h>
#include <stdlib.h>
int main(void) {
char c[5] = "abcd", d[5];
__asm__(
"leal %1, %%ebx\n"
"leal %0, %%ecx\n"
"movb (%%ecx), %%al\n"
"movb %%al, (%%ebx)\n"
"movb 1(%%ecx), %%al\n"
"movb %%al, 1(%%ebx)\n"
"movb 2(%%ecx), %%al\n"
"movb %%al, 2(%%ebx)\n"
"movb 3(%%ecx), %%al\n"
"movb %%al, 3(%%ebx)\n"
"movb 4(%%ecx), %%al\n"
"movb %%al, 4(%%ebx)\n"
"movb $'x', (%%ebx)\n"
:"=m"(c)
:"m"(d)
:"%ebx", "%ecx", "%eax"
);
printf("String d is: %s\n", d);
printf("String c remains: %s\n", c);
return 0;
}
When using MinGW gcc compiler on Windows PC, the following out put is produced:
> gcc testAsm.c
> .\a.exe
String d is: xbcd
String c remains: abcd
Upvotes: 0