user1466594
user1466594

Reputation: 203

disassembler showing different instructions

Was just reading about different algos disassemblers use to identify binary as assembly instructions. Opened a program in different disassemblers, some showed specific portions of the program as code and others showed the same portion as data. So my question is if disassemblers gets confused between opcodes being either instructions or data, how does the processor knows exactly what to do with that opcode?

I hope my question is clear.. Thanks in advance..

Upvotes: 3

Views: 546

Answers (2)

Alexey Frunze
Alexey Frunze

Reputation: 62068

The processor doesn't know if whatever it's asked to execute is code or data. It can be either or both at the same time. The CPU will attempt to execute whatever it's given.

If it fails to execute, it can generate an event such as "invalid instruction encountered" or "the memory referenced by the instruction isn't accessible" or "division by zero" or "insufficient privileges" that the OS will (hopefully) handle. It will either fix the problem if it knows how (virtual memory is usually based on this mechanism) or let the application handle this event or terminate the application.

There are different disassemblers. Some are "dumb" disassemblers in that they don't try to make much or any sense of the executable file format, they will just try to disassemble whatever they are given. Others will disassemble portions of the file that are marked as code and they will start disassemblying from the entry point location (every executable has a location where its execution should be started by the OS/CPU) and use various heuristics to do sensible disassemblying.

However, disassemblying can hardly ever be done perfectly. The main problem with correct disassemblying is that disassemblers don't know what a piece of code will do and what it won't do.

For example, code can be written such that it calculates an address to jump or call to. The disassembler won't be able to calculate such an address because, well, it doesn't execute, emulate or interpret code. So the disassembler may be unable to figure out the next location to disassemble from.

There are also CPUs that have variable-length instructions. This makes it possible for code to jump into the middle of an instruction. How should the disassembler disassemble that kind of code?

Another aggravating practice is manipulation with code. Code can change itself on the fly as it executes. Code can also generate more code. Code can also be stored as data. How do you disassemble all that?

It is therefore unsurprising that many disassemblers continue to be pretty much dumb. They just can't compete with the brainpower of the programmers who write programs with all sorts of twists.

EDIT:

Also, because of that same variable-length instruction issue, disassemblying the same code starting at slightly different locations can produce different instructions.

Example:

Consider this byte sequence for the x86 processor in 32-bit mode: 66h,0B8h,90h,90h,90h,90h.

If you start disassemblying it at the very first byte you will get:

mov ax,9090h
nop
nop

If you start disassemblying at the next byte you will get:

mov eax,90909090h

If you skip yet another byte you will get:

nop
nop
nop
nop

Upvotes: 3

user257111
user257111

Reputation:

Was just reading about different algos disassemblers use to identify binary as assembly instructions.

By that I'm assuming you mean linear sweep versus recursive traversal - there's an interesting page on that here.

So my question is if disassemblers gets confused between opcodes being either instructions or data, how does the processor knows exactly what to do with that opcode?

So, the meat of the issue - they don't and nor do they care. The CPU doesn't know anything about data versus instructions. This is why you can execute input on the stack from a buffer overflow by substituting in a string containing opcodes. This is defeated by marking pages no-execute, in which case if the instruction pointer (EIP/RIP) ends up there the processor just raises a fault (moans at the OS, basically).

The challenge with disassembly is that you're trying to work out the structure of the code doing everything bar actually running it. The only way to solve this would be to produce an x86 emulator and use that.

This is referred to as the halting problem.

Upvotes: 3

Related Questions