Proper way to do imports with gas

Question

From my two previous questions -- one related to importing constants and one related to importing functions -- Import constants in x86 with gas and Why does this program loop?, I was wondering if the following accurately summarizes how to do imports using as in assembly with an example:

# constants.s
SYS_EXIT        = 60
SYS_WRITE       = 1
STDOUT_FILENO   = 1

# utils.s
.include "constants.s"

# Global function
.globl print_string
print_string:

    call get_string_length
    mov %eax, %edx
    mov %rdi,       %rsi
    mov $1,         %edi
    mov $SYS_WRITE, %eax
    syscall
    ret

# Local function (for now)
get_string_length:
    mov $0, %eax # string length goes in rax
  L1_get_string_length:
    cmp $0, (%rdi, %rax,)
    je L2_get_string_length
    inc %eax
    jmp L1_get_string_length
  L2_get_string_length:
    ret

# file.s
.include "constants.s"

.data
str:    .string "Hellllloooo"

.text
.globl _start
_start:
    mov $str,   %rdi
    call print_string
    mov $0, %edi
    mov $SYS_EXIT, %eax
    syscall

If my understanding is correct then:

Functions need to be made .globl to be accessible to other object files during linking. Both those object files need to be linked together, for example: ld file.o utils.o -o file.
Definitions or macros can be imported/included using .include "filename". What this does is essentially copy/paste the contents of that included file into where that directive is. We do not need to link -- or do anything additional beyond -- the .include statement of that file. Does it matter if multiple files use that same include statement?
Any other things I may be missing or tips with imports, includes, etc? Does .include take a standard unix path, for example I could do: .include "../constants.s" or .include "/home/constants.s"?

Nate Eldredge · Accepted Answer

Here are four possible ways to "import constants from a file".

1. Using `.include` and `=` (uses gas only)

constants.inc:

    ANSWER_TO_LIFE = 0x42

code.s:

    .include "constants.inc"
    mov $ANSWER_TO_LIFE, %eax
    add $ANSWER_TO_LIFE, %ebx   # best encoding
    mov $(ANSWER_TO_LIFE+17), %ecx
    mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %edx

Building:

as -o code.o code.s         # or gcc -c code.s
ld -o prog code.o code2.o   # or gcc -o prog code.o code2.o

This is the most straightforward approach using only the features of the GNU assembler itself. I have named the include file .inc instead of .s to indicate that it is meant to be included into other assembly source files, but not assembled on its own (as it would produce an object file containing nothing). You can include it into as many different files as need to use the constant, and relative or absolute paths are supported (.include ../include/constants.inc, .include /usr/share/include/constants.inc both work).

Since the assembler knows the value of the constant, it can choose the best instruction encodings. For instance, the x86 add $imm, %reg32 instruction has two possible encodings: a 6-byte encoding with a 32-bit immediate operand (opcode 0x81), and a smaller 3-byte encoding with an 8-bit sign-extended immediate operand (opcode 0x83). Since 0x42 fits in 8 bits, the latter is available here, so add $0x42, %ebx can be encoded in three bytes as 83 c3 42. The example also shows that we can perform arbitrary arithmetic on the constant at assembly time.

2. Using the C preprocessor (most common in practice)

constants.h:

#define ANSWER_TO_LIFE 0x42

code.S:

#include "constants.h"
    mov $ANSWER_TO_LIFE, %eax
    add $ANSWER_TO_LIFE, %ebx   # also gets best encoding
    mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %ecx

Building:

gcc -c code.S              # can't use as by itself here
ld -o prog code.o code2.o  # or gcc if you prefer

In this approach, you run the C preprocessor cpp on the source file before giving it to the assembler. The gcc command will do this for you if you name the source file with .S (note case sensitivity). Then C-style #include and #define directives are expanded, so the assembler only sees mov $0x42, %eax without any indication that the constant ever had a name.

This approach has the advantage that the file constants.h can equally well be included into C code, which is helpful in the very common situation when your project mixes C and assembly source. As such, it is the approach I've most commonly seen "in the wild". (Practically no real-life programs are written entirely in assembly.)

In your original use case, where the constant in question was a Linux system call number, this approach is best because the relevant include file has already been written by the kernel developers, and you can get it with #include . This file defines all the system call numbers with macro names of the form __NR_exit.

3. As a symbol resolved at link time (somewhat awkward)

constants.s:

    .global ANSWER_TO_LIFE
    ANSWER_TO_LIFE = 0x42

code.s:

    mov $ANSWER_TO_LIFE, %eax
    add $ANSWER_TO_LIFE, %ebx   # not the optimal encoding
    mov $(ANSWER_TO_LIFE+17), %ecx
    #mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %ecx # error

Building:

as -o constants.o constants.s          # or gcc -c constants.s
as -o code.o code.s                    # etc
ld -o prog constants.o code.o code2.o  # or gcc

This is the approach mentioned by @fuz in comments. It treats the symbol ANSWER_TO_LIFE like a label that happens to be located at absolute address 0x42. The assembler treats it as it would any other label; it doesn't know its address at assembly time, so it leaves it as an unresolved reference in the object file code.o, which the linker will eventually resolve.

The only real benefit of this approach that I can see is that if we want to change the value of the constant, say to 0x43, we don't have to re-run the assembler on all our source files code.s code2.s ...; we only have to re-assemble constant.s and re-link. So we save a little bit of build time, but not much because assembling code is usually pretty fast anyway. (It might make a difference if we referenced the symbol from C or C++ code, for which compilation is slower, but see below.)

But there are some notable disadvantages:

Since the assembler doesn't know the value of the constant, it has to assume it could be of the largest size valid for each instruction in which it is used. In particular, in add $ANSWER_TO_LIFE, %ebx, it can't assume that the 8-bit 0x83 encoding will be usable, so it has to select the larger 32-bit encoding. So the instruction add $ANSWER_TO_LIFE, %ebx has to be assembled as 81 c3 00 00 00 00, where the 00 00 00 00 is replaced by the linker with the correct value 42 00 00 00. But we end up using 6 bytes on an instruction that ideally could have been encoded using 3 bytes.
On the flip side of this, immediate mov into a 64-bit register also has two encodings: one taking a sign-extended 32-bit immediate mov $imm32, %reg64 (opcode c7 with REX.W prefix), which is 7 bytes, and another taking a full 64-bit immediate mov $imm64, %reg64 (opcodes b8-b4 with REX.W), which is 10 bytes. The assembler will by default select the 32-bit form, because the 64-bit one is really long and rarely needed. But if it turns out that your symbol has a value that doesn't fit in 32 bits, you'll get an error at link time ("relocation truncated to fit"), and you'll have to go back and force the 64-bit encoding by using the mnemonic movabs. If you had used approaches 1 or 2, the assembler would have known the value of your constant and would have selected the appropriate encoding in the first place.
If we want to do build-time arithmetic on the constant, we're limited to whatever arithmetic can be represented as relocations in the object file. Constant offsets work, so mov $(ANSWER_TO_LIFE+17), %ecx is okay; the object file tells the linker to fill in the relevant bytes with the value of the symbol ANSWER_TO_LIFE plus the constant 17. (For actual labels, you'd want this for something like accessing a member from a static struct.) But more general operations like multiplication aren't supported, because people wouldn't normally want to do those on addresses, so mov $(ANSWER_TO_LIFE*ANSWER_TO_LIFE), %edx results in an error from the assembler. If we need the square of the answer to life, we have to write a mul instruction to compute it at run time, which would be no fun if this is code that is called frequently and needs to be fast.

The constant can also be accessed from C code linked into our project, but it has to be treated like a label (address of a variable), which makes it look weird. We have to write something like

extern void *ANSWER_TO_LIFE;
printf("The answer is %lu
", (unsigned long)&ANSWER_TO_LIFE);

If we try to write something more natural-looking like

extern unsigned long ANSWER_TO_LIFE;
printf("The answer is %lu
", ANSWER_TO_LIFE);

the program will try to fetch the value from memory address 0x42, which will crash.

(Also, even in the first example, the assembly output by the compiler uses the mov mnemonic which again leads to the assembler selecting a 32-bit move. If ANSWER_TO_LIFE were larger than 2^32 then linking would fail, and this time it's not as easy to fix. AFAIK you'd need to give gcc an appropriate option to tell it to change its code model, which would cause every address load to use the less efficient 64-bit form, and you'd have to do this for your entire program.)

4. As a value stored in memory and fetched at runtime (inefficient)

constants.s:

    .section .rodata
    .global answer_to_life
answer_to_life:
    .int 0x42

code.s:

    mov answer_to_life, %eax
    add answer_to_life, %ebx

    # mov answer_to_life+17, %ecx # not valid, no such instruction exists
    mov answer_to_life, %ecx
    add $17, %ecx   # needs two instructions

    # mov answer_to_life*answer_to_life, %edx # not valid
    mov answer_to_life, %eax
    mul %eax  # clobbers %edx

Building:

as -o constants.o constants.s
as -o code.o code.s
ld -o prog constants.o code.o code2.o

This approach is the equivalent of having const int answer_to_life = 42; in a C program (though C++ is different). The value 42 is stored in our program's memory and whenever we need to access it, we need an instruction that reads from memory; we can no longer encode it as an immediate within each instruction. This will typically be slower to execute. If we need to do any arithmetic on it, we have to write code to load it into a register and execute appropriate instructions at runtime, which takes cycles and code space.

I've changed the name here to lower case to match the convention for variables located in memory, as opposed to "compile time" constants which this no longer is. Also note the different syntax in the instructions; mov answer_to_life, %eax, without the $ sign, is a load from memory instead of an immediate move. $answer_to_life in this example gives you the address of the variable instead (which by a happy coincidence is 0x402000 in my test program). If you want to be able to build a position-independent executable, which is the norm for modern Linux programs, you need to write answer_to_life(%rip) instead.

For the reasons noted above, this approach is not ideal for numerical constants that truly are known at compile time, but I include it for completeness and because you asked about it in comments.

Proper way to do imports with gas

Answers (1)

1. Using `.include` and `=` (uses gas only)

2. Using the C preprocessor (most common in practice)

3. As a symbol resolved at link time (somewhat awkward)

4. As a value stored in memory and fetched at runtime (inefficient)

Related Questions

Proper way to do imports with gas

Answers (1)

1. Using .include and = (uses gas only)

2. Using the C preprocessor (most common in practice)

3. As a symbol resolved at link time (somewhat awkward)

4. As a value stored in memory and fetched at runtime (inefficient)

Related Questions

1. Using `.include` and `=` (uses gas only)