How to know an assembly code is using RAM?

Question

I am very new into assembly and this is a basic question.

I have just heard about the concept of using zero bytes of RAM.

I have compiled a C++ code via

g++ -O3 main.cpp -S -o main3.s

main.cpp (source)

#include 
using namespace std;

int main()
{
    int low=10, high=100, i, flag;

    cout << "Prime numbers between " << low << " and " << high << " are: ";

    while (low < high)
    {
        flag = 0;

        for(i = 2; i <= low/2; ++i)
        {
            if(low % i == 0)
            {
                flag = 1;
                break;
            }
        }

        if (flag == 0)
            cout << low << " ";

        ++low;
    }

    return 0;
}

And here is the result:

main3.s

    .file   "main.cpp"
    .section    .rodata.str1.1,"aMS",@progbits,1
.LC0:
    .string "Prime numbers between "
.LC1:
    .string " and "
.LC2:
    .string " are: "
.LC3:
    .string " "
    .section    .text.startup,"ax",@progbits
    .p2align 4,,15
    .globl  main
    .type   main, @function
main:
.LFB1561:
    .cfi_startproc
    pushq   %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    movl    $22, %edx
    movl    $.LC0, %esi
    movl    $_ZSt4cout, %edi
    call    _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
    movl    $10, %esi
    movl    $_ZSt4cout, %edi
    call    _ZNSolsEi
    movl    $5, %edx
    movq    %rax, %rbx
    movl    $.LC1, %esi
    movq    %rax, %rdi
    call    _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
    movq    %rbx, %rdi
    movl    $100, %esi
    movl    $10, %ebx
    call    _ZNSolsEi
    movl    $.LC2, %esi
    movq    %rax, %rdi
    call    _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
    .p2align 4,,10
    .p2align 3
.L6:
    movl    %ebx, %esi
    sarl    %esi
    testb   $1, %bl
    je  .L2
    movl    $2, %ecx
    jmp .L3
    .p2align 4,,10
    .p2align 3
.L14:
    movl    %ebx, %eax
    cltd
    idivl   %ecx
    testl   %edx, %edx
    je  .L2
.L3:
    addl    $1, %ecx
    cmpl    %esi, %ecx
    jle .L14
    movl    %ebx, %esi
    movl    $_ZSt4cout, %edi
    call    _ZNSolsEi
    movl    $1, %edx
    movl    $.LC3, %esi
    movq    %rax, %rdi
    call    _ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
.L2:
    addl    $1, %ebx
    cmpl    $100, %ebx
    jne .L6
    xorl    %eax, %eax
    popq    %rbx
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
.LFE1561:
    .size   main, .-main
    .p2align 4,,15
    .type   _GLOBAL__sub_I_main, @function
_GLOBAL__sub_I_main:
.LFB2045:
    .cfi_startproc
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    movl    $_ZStL8__ioinit, %edi
    call    _ZNSt8ios_base4InitC1Ev
    movl    $__dso_handle, %edx
    movl    $_ZStL8__ioinit, %esi
    movl    $_ZNSt8ios_base4InitD1Ev, %edi
    addq    $8, %rsp
    .cfi_def_cfa_offset 8
    jmp __cxa_atexit
    .cfi_endproc
.LFE2045:
    .size   _GLOBAL__sub_I_main, .-_GLOBAL__sub_I_main
    .section    .init_array,"aw"
    .align 8
    .quad   _GLOBAL__sub_I_main
    .local  _ZStL8__ioinit
    .comm   _ZStL8__ioinit,1,1
    .hidden __dso_handle
    .ident  "GCC: (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0"
    .section    .note.GNU-stack,"",@progbits

This is a basic program which can store all variables into CPU registers. Therefore, I guess it does not use RAM. I would like to know what is the criteria to check if an assembly code is using RAM?

Peter Cordes · Accepted Answer

In the clip you linked, Jason Turner just said that the C local variables all fit in registers, so the compiler doesn't ever have to spend extra instructions spilling/reloading them.

It's using RAM to store code and data, it's just not using any stack memory to store local variables. i.e. zero bytes of RAM for local variables, of course not zero bytes total. He even says the game compiles to 1005 bytes (of code + data).

You detect this when reading asm by noting a lack of loads/stores to the stack, e.g. with addressing modes using RSP (or RBP if used as a frame pointer), on x86-64.

This is totally normal for functions that aren't huge. Inlining function calls is key to making it happen otherwise, because compilers usually have to have memory "in sync" (reflecting the correct values of the C abstract machine) when calling a non-inline function.

int foo(int num) {
    int tmp = num * num;
    return tmp;
}

gets num in a register, and keeps tmp there. Jason's talk was using Godbolt, so here's a link to the same function on Godbolt, compiled by gcc7.3 with and without optimization:

 foo:   # with optimization: all operands are registers
    imul    edi, edi
    mov     eax, edi
    ret

foo:    # without optimization:
    push    rbp
    mov     rbp, rsp                     # make a stack frame with RBP
    mov     DWORD PTR [rbp-20], edi      # spill num to the stack
      # start of code for first C statement
    mov     eax, DWORD PTR [rbp-20]      # reload it
    imul    eax, DWORD PTR [rbp-20]      # and use it from memory again
    mov     DWORD PTR [rbp-4], eax       # spill tmp to the stack
      # end of first C statement

    mov     eax, DWORD PTR [rbp-4]       # load tmp into the return value register, eax)
    pop     rbp
    ret

This didn't have to reserve any stack space with sub rsp, 24, because it's using the red-zone below RSP for the locals it's spilling / reloading.

Obviously with optimization enabled, you won't get code this bad even when a compiler does run out of registers in a large complex function and has to spill something. -O0 is kind of an anti-optimization mode where each C statement gets a separate block of asm, so you can set breakpoints and modify variables and have the code still work. Or even jump to a different source line in gdb!

Re: How many registers does x86 have, as mentioned in the talk:

i386 has 8 architectural integer registers. It has some segment registers you could abuse to keep extra values, and if it has an FPU there are 8 x87 80-bit FP stack registers. Jason's guess of 16 sounds bogus, but he may be counting AL/AH, BL/BH as separate 8-bit registers, because you can use them independently. But not at the same time as EAX, because the narrow registers are subsets of full registers.

(And beware of partial-register penalties on various modern microarchitectures. On AMD, AL and AH aren't independent at all; using one has a false dependency on the other, i.e. on the whole EAX/RAX. On CPUs up to and including Pentium P5MMX, there were no partial-register penalties at all, because no out-of-order execution or register renaming.)

His claim that modern x86-64 has hundreds of registers is also definitely bogus, unless you count all the control registers and model-specific registers. But stack memory is much faster than those registers, and you can't put arbitrary values in them anyway. With only 16 architectural integer registers (one of them being the stack pointer, so really 15 regs you can use in a big function), you still need extra instructions to spill or at least reload stuff when you need more variables "live" at once than that.

Register renaming onto a large pool of physical registers is great, and essential along with a large ReOrder Buffer for a large out-of-order execution window to find instruction-level parallelism. But you can only take advantage of these registers by reusing the same integer registers for different values. (i.e. register renaming avoids write-after-read and write-after-write hazards, making two uses of the same register actually independent.)

Haswell has a 168-entry physical register file for integer/GP registers, and also a 168-entry vector/FP register file for renaming FP / vector registers. https://www.realworldtech.com/haswell-cpu/3/. But architecturally it only has 16 GP / 16 YMM when running in x86-64 mode, or 8 / 8 in ia-32 mode.

How to know an assembly code is using RAM?

Answers (2)

Related Questions