evodevo
evodevo

Reputation: 509

How the com disassembler knows where the code ends and the data starts

When disassembling an old .com executable file compiled from a code like this:

.model tiny             ; com program
.code                   ; code segment
org 100h                ; code starts at offset 100h    

main proc near
   mov ah,09h           ; function to display a string  
   mov dx,offset message    ; offset ofMessage string terminating with $
   int 21h              ; dos interrupt

   mov ah,4ch           ; function to terminate
   mov al,00
   int 21h              ; Dos Interrupt 
endp 
message db "Hello World $"      ; Message to be displayed terminating with a $
end main

in hex it looks like this:

B4 09 BA 0D 01 CD 21 B4 4C B0 00 CD 21 48 65 6C 6C 6F 20 57 6F 72 6C 64 20 24

how the disassembler knows where the code ends and the string "Hello world" starts?

Upvotes: 0

Views: 320

Answers (1)

nrz
nrz

Reputation: 10570

Disassembler does not know where the code ends and where the data starts in a .com file, because in .com files there is no such distinction. In .com files everything is loaded into the same segment and as DOS runs in real mode and does not have any kind of memory protection at all, you can for example write obfuscated code that looks like regular text and jump into it in your code. For example (possibly crashes DOS, haven't tested):

_start: jmp hello

hello:
db "Hello World!"

ret

So db "Hello World $" is perfectly valid 16-bit code (checked with udcli disassembler that comes with udis86 disassembler library for x86 and x86-64 in Linux:

$ echo `echo 'Hello World $' | tr -d "\n" | od -An -t xC` | udcli -x -16

0000000000000000 48               dec ax            ; H
0000000000000001 656c             insb              ; el
0000000000000003 6c               insb              ; l
0000000000000004 6f               outsw             ; o
0000000000000005 20576f           and [bx+0x6f], dl ; <space>Wo
0000000000000008 726c             jb 0x76           ; rl
000000000000000a 642024           and [fs:si], ah   ; d<space>$

However, db 0x64 0x20 0x24 is not valid 32-bit or 64-bit code.

This is 32-bit disassembly of db "Hello World! $":

$ echo `echo 'Hello World $' | tr -d "\n" | od -An -t xC` | udcli -x -32

0000000000000000 48               dec eax            ; H
0000000000000001 656c             insb               ; el
0000000000000003 6c               insb               ; l
0000000000000004 6f               outsd              ; o
0000000000000005 20576f           and [edi+0x6f], dl ; <space>Wo
0000000000000008 726c             jb 0x76            ; rl
000000000000000a 642024           invalid            ; d<space>$

What a disassembler can do is to use some heuristics and code tracing to decide whether to print some parts of the disassembly as code and some other parts as data. But a disassembler can never know where code ends and where data begins, because in .com files such distinction exists only in the programmer's head and possibly in source code and in assembler's limitations, but not in the binary .com file format itself.

Upvotes: 1

Related Questions