Reputation: 4767
From my two previous questions -- one related to importing constants and one related to importing functions -- Import constants in x86 with gas and Why does this program loop?, I was wondering if the following accurately summarizes how to do imports using as
in assembly with an example:
# constants.s
SYS_EXIT = 60
SYS_WRITE = 1
STDOUT_FILENO = 1
# utils.s
.include "constants.s"
# Global function
.globl print_string
print_string:
call get_string_length
mov %eax, %edx
mov %rdi, %rsi
mov $1, %edi
mov $SYS_WRITE, %eax
syscall
ret
# Local function (for now)
get_string_length:
mov $0, %eax # string length goes in rax
L1_get_string_length:
cmp $0, (%rdi, %rax,)
je L2_get_string_length
inc %eax
jmp L1_get_string_length
L2_get_string_length:
ret
# file.s
.include "constants.s"
.data
str: .string "Hellllloooo"
.text
.globl _start
_start:
mov $str, %rdi
call print_string
mov $0, %edi
mov $SYS_EXIT, %eax
syscall
If my understanding is correct then:
.globl
to be accessible to other object files during linking. Both those object files need to be linked together, for example: ld file.o utils.o -o file
..include "filename"
. What this does is essentially copy/paste the contents of that included file into where that directive is. We do not need to link -- or do anything additional beyond -- the .include
statement of that file. Does it matter if multiple files use that same include statement?.include
take a standard unix path, for example I could do: .include "../constants.s"
or .include "/home/constants.s"
?Upvotes: 3
Views: 1319
Reputation: 58162
Here are four possible ways to "import constants from a file".
.include
and =
(uses gas only)constants.inc:
ANSWER_TO_LIFE = 0x42
code.s:
.include "constants.inc"
mov $ANSWER_TO_LIFE, %eax
add $ANSWER_TO_LIFE, %ebx # best encoding
mov $(ANSWER_TO_LIFE+17), %ecx
mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %edx
Building:
as -o code.o code.s # or gcc -c code.s
ld -o prog code.o code2.o # or gcc -o prog code.o code2.o
This is the most straightforward approach using only the features of the GNU assembler itself. I have named the include file .inc
instead of .s
to indicate that it is meant to be included into other assembly source files, but not assembled on its own (as it would produce an object file containing nothing). You can include it into as many different files as need to use the constant, and relative or absolute paths are supported (.include ../include/constants.inc
, .include /usr/share/include/constants.inc
both work).
Since the assembler knows the value of the constant, it can choose the best instruction encodings. For instance, the x86 add $imm, %reg32
instruction has two possible encodings: a 6-byte encoding with a 32-bit immediate operand (opcode 0x81), and a smaller 3-byte encoding with an 8-bit sign-extended immediate operand (opcode 0x83). Since 0x42 fits in 8 bits, the latter is available here, so add $0x42, %ebx
can be encoded in three bytes as 83 c3 42
. The example also shows that we can perform arbitrary arithmetic on the constant at assembly time.
constants.h:
#define ANSWER_TO_LIFE 0x42
code.S:
#include "constants.h"
mov $ANSWER_TO_LIFE, %eax
add $ANSWER_TO_LIFE, %ebx # also gets best encoding
mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %ecx
Building:
gcc -c code.S # can't use as by itself here
ld -o prog code.o code2.o # or gcc if you prefer
In this approach, you run the C preprocessor cpp
on the source file before giving it to the assembler. The gcc
command will do this for you if you name the source file with .S
(note case sensitivity). Then C-style #include
and #define
directives are expanded, so the assembler only sees mov $0x42, %eax
without any indication that the constant ever had a name.
This approach has the advantage that the file constants.h
can equally well be included into C code, which is helpful in the very common situation when your project mixes C and assembly source. As such, it is the approach I've most commonly seen "in the wild". (Practically no real-life programs are written entirely in assembly.)
In your original use case, where the constant in question was a Linux system call number, this approach is best because the relevant include file has already been written by the kernel developers, and you can get it with #include <asm/unistd.h>
. This file defines all the system call numbers with macro names of the form __NR_exit
.
constants.s:
.global ANSWER_TO_LIFE
ANSWER_TO_LIFE = 0x42
code.s:
mov $ANSWER_TO_LIFE, %eax
add $ANSWER_TO_LIFE, %ebx # not the optimal encoding
mov $(ANSWER_TO_LIFE+17), %ecx
#mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %ecx # error
Building:
as -o constants.o constants.s # or gcc -c constants.s
as -o code.o code.s # etc
ld -o prog constants.o code.o code2.o # or gcc
This is the approach mentioned by @fuz in comments. It treats the symbol ANSWER_TO_LIFE
like a label that happens to be located at absolute address 0x42
. The assembler treats it as it would any other label; it doesn't know its address at assembly time, so it leaves it as an unresolved reference in the object file code.o
, which the linker will eventually resolve.
The only real benefit of this approach that I can see is that if we want to change the value of the constant, say to 0x43, we don't have to re-run the assembler on all our source files code.s code2.s ...
; we only have to re-assemble constant.s
and re-link. So we save a little bit of build time, but not much because assembling code is usually pretty fast anyway. (It might make a difference if we referenced the symbol from C or C++ code, for which compilation is slower, but see below.)
But there are some notable disadvantages:
Since the assembler doesn't know the value of the constant, it has to assume it could be of the largest size valid for each instruction in which it is used. In particular, in add $ANSWER_TO_LIFE, %ebx
, it can't assume that the 8-bit 0x83 encoding will be usable, so it has to select the larger 32-bit encoding. So the instruction add $ANSWER_TO_LIFE, %ebx
has to be assembled as 81 c3 00 00 00 00
, where the 00 00 00 00
is replaced by the linker with the correct value 42 00 00 00
. But we end up using 6 bytes on an instruction that ideally could have been encoded using 3 bytes.
On the flip side of this, immediate mov
into a 64-bit register also has two encodings: one taking a sign-extended 32-bit immediate mov $imm32, %reg64
(opcode c7 with REX.W prefix), which is 7 bytes, and another taking a full 64-bit immediate mov $imm64, %reg64
(opcodes b8-b4 with REX.W), which is 10 bytes. The assembler will by default select the 32-bit form, because the 64-bit one is really long and rarely needed. But if it turns out that your symbol has a value that doesn't fit in 32 bits, you'll get an error at link time ("relocation truncated to fit"), and you'll have to go back and force the 64-bit encoding by using the mnemonic movabs
. If you had used approaches 1 or 2, the assembler would have known the value of your constant and would have selected the appropriate encoding in the first place.
If we want to do build-time arithmetic on the constant, we're limited to whatever arithmetic can be represented as relocations in the object file. Constant offsets work, so mov $(ANSWER_TO_LIFE+17), %ecx
is okay; the object file tells the linker to fill in the relevant bytes with the value of the symbol ANSWER_TO_LIFE
plus the constant 17. (For actual labels, you'd want this for something like accessing a member from a static struct
.) But more general operations like multiplication aren't supported, because people wouldn't normally want to do those on addresses, so mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %edx
results in an error from the assembler. If we need the square of the answer to life, we have to write a mul
instruction to compute it at run time, which would be no fun if this is code that is called frequently and needs to be fast.
The constant can also be accessed from C code linked into our project, but it has to be treated like a label (address of a variable), which makes it look weird. We have to write something like
extern void *ANSWER_TO_LIFE;
printf("The answer is %lu\n", (unsigned long)&ANSWER_TO_LIFE);
If we try to write something more natural-looking like
extern unsigned long ANSWER_TO_LIFE;
printf("The answer is %lu\n", ANSWER_TO_LIFE);
the program will try to fetch the value from memory address 0x42, which will crash.
(Also, even in the first example, the assembly output by the compiler uses the mov
mnemonic which again leads to the assembler selecting a 32-bit move. If ANSWER_TO_LIFE
were larger than 2^32
then linking would fail, and this time it's not as easy to fix. AFAIK you'd need to give gcc an appropriate option to tell it to change its code model, which would cause every address load to use the less efficient 64-bit form, and you'd have to do this for your entire program.)
constants.s:
.section .rodata
.global answer_to_life
answer_to_life:
.int 0x42
code.s:
mov answer_to_life, %eax
add answer_to_life, %ebx
# mov answer_to_life+17, %ecx # not valid, no such instruction exists
mov answer_to_life, %ecx
add $17, %ecx # needs two instructions
# mov answer_to_life*answer_to_life, %edx # not valid
mov answer_to_life, %eax
mul %eax # clobbers %edx
Building:
as -o constants.o constants.s
as -o code.o code.s
ld -o prog constants.o code.o code2.o
This approach is the equivalent of having const int answer_to_life = 42;
in a C program (though C++ is different). The value 42 is stored in our program's memory and whenever we need to access it, we need an instruction that reads from memory; we can no longer encode it as an immediate within each instruction. This will typically be slower to execute. If we need to do any arithmetic on it, we have to write code to load it into a register and execute appropriate instructions at runtime, which takes cycles and code space.
I've changed the name here to lower case to match the convention for variables located in memory, as opposed to "compile time" constants which this no longer is. Also note the different syntax in the instructions; mov answer_to_life, %eax
, without the $
sign, is a load from memory instead of an immediate move. $answer_to_life
in this example gives you the address of the variable instead (which by a happy coincidence is 0x402000
in my test program). If you want to be able to build a position-independent executable, which is the norm for modern Linux programs, you need to write answer_to_life(%rip)
instead.
For the reasons noted above, this approach is not ideal for numerical constants that truly are known at compile time, but I include it for completeness and because you asked about it in comments.
Upvotes: 5