Asperger
Asperger

Reputation: 3222

Displaying all ascii characters in linux console (NASM assembly)

I read a tutorial on nasm and there is a code example which displays the entire ascii character set. I understand pretty much everything except why are we pushing ecx and popping ecx as I dont see how it relates to the rest of the code. Ecx has the value of 256 since we want all chars but no idea where and hows its used. Wht exactly is happening when we push and pop ecx? Why are we moving the address of achar to dx? I dont see us using dx for anything. I understand that we need to increment the adress of achar but im confused how the increment relates to ecx and dx. I would appreciate some insight.

   section  .text
       global _start        ;must be declared for using gcc

    _start:                 ;tell linker entry point
       call    display
       mov  eax,1           ;system call number (sys_exit)
       int  0x80            ;call kernel

    display:
       mov    ecx, 256

    next:
       push    ecx
       mov     eax, 4
       mov     ebx, 1
       mov     ecx, achar
       mov     edx, 1
       int     80h

       pop     ecx  
       mov  dx, [achar]
       cmp  byte [achar], 0dh
       inc  byte [achar]
       loop    next
       ret

    section .data
    achar db '0'  

Upvotes: 3

Views: 2264

Answers (1)

Ped7g
Ped7g

Reputation: 16606

I understand pretty much everything

Well, then you are sort of quite ahead of me... (although from your further comments you become aware of some other non-sense things in that code :) ).

why are we pushing ecx and popping ecx as I dont see how it relates to the rest of the code. Ecx has the value of 256 since we want all chars but no idea where and hows its used.

It is used by LOOP instruction (which is not a good idea: Why is the loop instruction slow?), it will decrement ecx, and jump when value is above zero, i.e. it's a count-down loop mechanism.

As the int 0x80 service call needs ecx for memory address value, the counter is saved/restored by push/pop around that. A more performant way would be to put counter value into some spare register like for example esi, and do dec esi jnz next. Even more performant way would be to re-use the character value itself, if the output would start with zero value, and not zero digit, then the zero flag after inc byte [achar] can be used to detect looping condition.

achar db '0'

It's not clear to me, why "display all ASCII characters" starts at digit zero (value 48), seems weird to me, I would start at zero. But that has another caveat, linux console I/O encoding is set by environment, and on any common linux installation it is UTF8 nowadays, so the valid printable single-byte characters are only of values 32-126 (which are identical to ordinary 7 bit ASCII encoding, making this part of example work well), and values 0-31 and 127 are non-printable control characters, also identical to common 7b ASCII encoding. Values 128-255 indicate in UTF8-encoding multi-byte character (example: ř is two bytes 0xC5 0x99), and as single bytes they are invalid byte sequence, because the remaining part of UTF8 "code point" bytes is missing.

In the age of DOS you could have wrote code writing directly into VGA text-mode video memory full 8 bit values going from zero to 255, and each has distinct graphical representation, you could specify in VGA custom font or known code-page for particular characters, this is also sometimes referred to as "extended ASCII", but the common DOS installation had different ones from the link in your comments, having many more box-drawing characters. This included \r and \n control characters, which are for VGA just another font glyph, not line-feed and new-line control chars (that meaning is created by BIOS/DOS service call, which instead of outputting \n character will move the internal cursor to next line and discard the char from output).

It's impossible to re-create this with linux console I/O (unless the UTF8 font contains all the weird DOS glyphs, and you would output their correct UTF8 encoding instead of single byte values).

Conclusion is, that the example starts with value '0' (48), and up till value 126 it outputs correct printable ASCII characters, after 126 it outputs "something", and as those bytes will sometimes form invalid UTF8 encodings, I would technically call it "bogus" output with undefined behaviour, you can get probably different results for different linux versions and console settings.

Also NASM-style notice: put colon after labels, i.e. achar: db '0', that will save you when you use instruction mnemonics as label by accident, like loop: or dec: db 'd'.

   mov  dx, [achar]

The dx is not used any further, so this is useless instruction.

   cmp  byte [achar], 0dh

Flags from this compare are not used any further either, so this is also useless.


So the adjusted example can look like this:

section  .text
    global _start       ;must be declared for using gcc

_start:                 ;tell linker entry point
    call    display
    mov     eax,1       ;system call number (sys_exit)
    int     0x80        ;call kernel

; displays all valid printable ASCII characters (32-126), and new-line after.
display:
    mov     byte [achar], ' '   ; first valid printable ASCII
next:
    mov     eax, 4
    mov     ebx, 1
    mov     ecx, achar
    mov     edx, 1
    int     0x80
    inc     byte [achar]
    cmp     byte [achar], 126
    jbe     next        ; repeat until all chars are printed
    ; that will output all 32..126 printable ASCII characters

    ; display one more character, new line (reuse of registers)
    mov     byte [achar], `\n`  ; NASM uses backticks for C-like meta chars
    mov     eax, 4      ; ebx, ecx and edx are already set from loop above
    int     0x80
    ret

section .bss
achar: resb 1           ; reserve one byte for character output

But it would make more sense to prepare whole output in memory first, and output it in one go, like this one:

section  .text
    global _start       ;makes symbol "_start" global (visible for linker)

_start:                 ;linker's default entry point
    call    display
    mov     eax,1       ;system call number (sys_exit)
    int     0x80        ;call kernel

; displays all valid printable ASCII characters (32-126), and new-line after.
display:
    ; prepare in memory string with all ASCII chars and new-line
    mov     al,' '      ; first valid printable ASCII
    mov     edi, allAsciiChars
    mov     ecx, edi    ; this address will be used also for "write" int 0x80
nextChar:
    mov     [edi], al
    inc     edi
    inc     al
    cmp     al, 126
    jbe     nextChar
    ; add one more new line at end
    mov     byte [edi], `\n`
    ; display the prepared "string" in one "write" call
    mov     eax, 4      ; sys_write, ecx is already set
    mov     ebx, 1      ; file descriptor STDOUT
    lea     edx, [edi+1]; edx = edi+1 (memory address beyond last char)
    sub     edx, ecx    ; edx = length of generated string
    int     0x80
    ret

section .bss
allAsciiChars: resb 126-' '+1+1 ; reserve space for ASCII characters and \n

All examples were tried with nasm 2.11.08 on 64b linux ("KDE neon" distro based on Ubuntu 16.04), and built by commands:

nasm -f elf32 -F dwarf -g test.asm -l test.lst -w+all
ld -m elf_i386 -o test test.o

with output:

$ ./test
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Upvotes: 3

Related Questions