Ambady Anand S
Ambady Anand S

Reputation: 86

Obtaining source code from the binary

Shouldn't it be possible to obtain the source code from its binary? Since compilation is the process of converting high level language (source code) into low level language (machine code), can't we just reverse the process in order to obtain the source code? If not, why?

Upvotes: 1

Views: 230

Answers (1)

Suppose I give you the number 3, and tell you that I obtained it by summing two numbers. Can you tell me the two numbers that 3 is a sum of? It's impossible, because sum is a one-way function - it's impossible to recover its arguments from its output. I could have obtained it from -55 and 58, even if for you 1 and 2 still works out to the same answer.

Compilation is similar. There's an infinite number of C++ programs that will generate any particular machine code output (more or less).

You can certainly reverse the compilation process and produce C or C++ code that would result, at least, in machine code with same semantics (meaning), but likely not byte-for-byte identical. Such tools exist to a varying degree.

So yes, it is possible, but because a lot of information from the source code necessarily has to be lost, the code you'll get back will not yield much insight into the design of the original source code. For any significantly-sized project, what you'd get back would be code that does the same thing, but is pretty much unreadable for a human. It'd be some very, very obfuscated C/C++.

Why is the information lost? Because the whole big deal with high-level languages is that they should be efficient for humans to deal with. The more high-level and human-comprehensible the source code is, the more it differs from the machine code that results when the compiler is done. As a software designer, your primary objective is to leverage the compiler and other code generating tools to transform high-level ideas and concepts into machine code. The larger the gap between the two, the more information about high-level design is lost.

Remember that the only thing that a compiler has to preserve is the semantics (meaning) of your code. As long as it appears that the code is doing what you meant it to do, everything is fine. Modern compilers can, for example, pre-execute parts of your code and only store the results of the operations, when it makes "sense" to do so according to some metric. Say that your entire program reads as follows:

#include <iostream>
#include <cmath>
int main() {
  std::cout << sin(1)+cos(2) << std::endl;
  return 0;
}

Assuming a Unix system, the compiler is perfectly within its right to produce machine code that executes two syscalls: a write to stdout, with a constant buffer, and an exit. Upon decompiling such code, you could get:

#include <unistd.h>
#include <cstdlib>
int main() {
  write(0, "0.425324\n", 9);
  exit(0);
}

Upvotes: 2

Related Questions