Why are disassembled data becoming instructions?

Question

I need some help to understand what happens in the moment when this fragment of code "happens": "jmp Begin". I understand only that .com file can be 64kb so you want to put everything in one segment. You need to jmp if you want to put variables. But when I search about it, many guides just say in comment that jmp Begin is only to skip data and nothing else. And here is my question: What exactly happens in this moment:

It appears that it runs this

        mov     al, a
        mov     bl, b
        sub     al, bl

But I can't understand why it looks like this in turbo debugger. When I change starting value of Result from ? to something greater than 0 it changes to something else and when I change it for example to 90 it looks completely normal. I am completely new to assembly and I can't seem to grasp it at all. Here is my whole code:

            .MODEL TINY

Code        SEGMENT

            ORG    100h
            ASSUME CS:Code, DS:Code

Start:
                jmp     Begin
a               EQU     20
b               EQU     10
c               EQU     100
d               EQU     5
Result          DB      ?


Begin:

            mov     al, a
            mov     bl, b
            sub     al, bl
            mov     bl, c
            mul     bl
            mov     bl, d
            div     bl              
            mov     Result, al
            mov     ah, 4ch
            int     21h

Code        ENDS
            END             Start

pasztorpisti · Accepted Answer

I try to give you an explanation.

The problem is that in the old days (and this is partly still true today) the processors didn't differentiate code and data bytes in memory. This means that any byte in your .com file can be used as both code and data. The debugger has no clue which bytes will be executed as code and which bytes will be used as data. A byte can actually be used as both code and data in tricky cases... Your program can create data in memory that is valid as code and you can jump onto it to execute it.

In many (but not all) cases the debugger could actually find out what is code and what is data but this code analysis can get very complex so most debuggers/disassemblers simply don't have such code flow analyzer. For this reason they just pick an offset in your file/memory (this is usually the current instruction pointer) and starting from this offset they decode a series of consecutive bytes as assembly instructions serially without following any jmp instructions until the screen of the debugger is completely filled with enough number of disassembled lines. Dumb disassemblers/debuggers don't care whether the disassembled bytes are actually used as instructions or data in your program, they treat them as instructions.

If you are debugging your program and the debugger stops at a breakpoint then it takes the current instruction pointer and performs a dumb disassembly again starting from that offset with the primitive "fill the debugger screen" method.

This serial disassembly of consecutive bytes is a simple method that works most of the time. If you serially decode non-jmp instructions that follow each other than you can be almost sure that the processor will execute them in this order. However, once you reach and decode a jmp instruction you can't be sure that the following bytes are valid as code. You can however try to decode them as instructions hoping that there is no data mixed into the middle of the code (and yes, in most cases there is no data after a jmp (or similar control flow instruction), this is why debuggers give you a dumb disassembly as a "possibly useful prediction"). In fact, most of the code is usually full of conditional jumps and disassembling the bytes after them as code is very useful help from the debugger. Having data in the middle of the code after a jump instruction is quite rare, we can treat it as an edge case.

Let's assume that you have a simple .com program that just jumps over some data and then exists with an int 20h:

    jmp start
    db  90h
start:
    int 20h

The disassembler would probably tell you something like the following by disassembling starting from offset 0000:

--> 0000   eb 01        jmp short 0003
    0002   90           nop
    0003   cd 20        int 20h

Cool, this looks exactly like our asm source code... Now let's change the program a bit: let's change the data...

    jmp start
    db  cdh
start:
    int 20h

Now the the disassembler will show you this:

--> 0000   eb 01        jmp short 0003
    0002   cd cd        int cdh
    0004   20 ...... whatever...

The problem is that some instructions consist of more than 1 byte and the debugger doesn't care whether bytes represent code or data for you. In the above example if the disassembler serially disassembles bytes from offset 0000 till the end of your program (including your data) then your 1 byte data will disassemble into a 2 byte instruction ("stealing" the first byte of your actual code) so the next instruction the debugger tries to disassemble will come at offset 0004 instead of 0003 where your jmp would normally jump. In the first example we didn't have such a problem because the data disassembled into a 1 byte instruction and accidentally after disassembling the data part of your program the next instruction to disassemble for the debugger was at offset 0003 that is exactly the target of your jmp.

However what the debugger shows to you in this case is fortunately not what will happen when your program gets executed. By executing one instruction the program would actually jump to offset 0003 and the debugger would do a dumb disassembly again but this time starting from offset 0003 that is in the middle of an instruction in the previous incorrect disassembly...

Let's say you debug the second example program and you execute all instruction in it one-by-one. When you start the program with instruction pointer == 0000 the debugger shows this:

--> 0000   eb 01        jmp short 0003
    0002   cd cd        int cdh
    0004   20 ...... whatever...

However when you trigger the "step" command to execute one instruction the instruction pointer (IP) changes to 0003 and the debugger performs a "dumb disassembling" again from offset 0003 till the debugger screen is filled up so you will see this:

--> 0003   cd 20      int 20h
    0005   ...... whatever...

Conclusion: If you have dumb disassemblers and you mix data into the middle of your code (with jmps around the data) then the dumb disassembler will treat your data as code and this may cause the "minor" issue you've encountered.

An advanced disassembler with flow analysis (like Ida Pro) would do the disassembling by following the jump instructions. After disassembling your jmp at offset 0000 it would find out that the next instruction to disassemble is the target of the jmp at 0003 and it would disassemble the int 20h as the next step. It would mark the db cdh byte at offset 0002 as data.

Additional explanation:

As you have already noticed an instruction in (the quite outdated) 8086 instruction set can be anywhere between 1-6 bytes long but a jmp or call can jump anywhere in memory with byte granularity. The length of the instruction can usually be determined from the first 1 or 2 bytes of the instruction. However bytes "stick together" into an instruction only when the processor targets the first byte of the instruction with its special IP (instruction pointer register) and tries to execute the bytes at the given offset. Let's see a tricky example: You have bytes eb ff 26 05 00 03 00 in memory at offset 0000 and you execute it step-by-step.

--> 0000   eb ff        jmp short 0001
    0002   26 05 00 03  es: add ax, 300h
    0006   00 ...... whatever...

The processor instruction pointer (IP) points to offset 0000 so it decodes an instruction and the bytes there "stick together into an instruction" for the time of execution. (The processor performs instruction decoding at 0000.) Since the first byte is eb it knows that the instruction length is 2 bytes. The debugger also knows this so it decodes the instruction for you and also generates some additional buggy disassembly based on the incorrect assumption that at some point the processor would execute an instruction at offset 0002, and then at offset 0006, etc... As you will see this isn't true, the processor will stick together bytes into instructions at quite different offsets.

As you see my tricky byte code contains a jmp that jumps to offset 0001 that is in the middle of the executed jmp instruction itself!!! This however isn't a problem at all. The processor doesn't care about it and happily jumps to offset 0001 so as a next step it will try to decode an instruction (or "stick together bytes") there. Let's see what kind of instruction will the processor find at 0001:

--> 0001   ff 26 05 00  jmp word ptr [5]
    0005   03 00        add ax, word ptr [bx+si]

As you see we have our next instruction at 0001 and the debugger shows us some garbage disassembly at offset 0005 based on the false assumption that the processor will get to that offset at some point...

The instruction at 0001 tells the processor to pick up a word from offset 0005 and interpret it as an offset to jump there. As you see the value of word ptr [5] is 3 (as a little endian 16 bit value) so the processor puts 3 into its IP register (jumps to 0003). Let's see what it finds at offset 0003:

--> 0003   05 00 03     add ax, 300h

It would be difficult to show a disassembly for my tricky byte code eb ff 26 05 00 03 00 in the style of the the debugger because the actual instructions executed by the processor are in overlapping memory areas. First the processor executed bytes 0000-0001, then 0001-0004, and finally 0003-0005.

In some newer risc architectures the length of instructions is fix and they have to be on aligned memory areas and it isn't possible to jump anywhere so the job of a debugger is much easier than in case of x86.

Why are disassembled data becoming instructions?

Answers (2)

Additional explanation:

Related Questions