Safety of MD5 for obfuscating identifiers in program code

Question

I want to irreversibly turn small text strings into fixed-size binary strings.

The input strings are identifiers in an arbitrary computer program which is being transformed in a number of unrelated ways. The goal of this particular part of the operation is to make the program more difficult to reverse engineer by removing meaningful identifiers.

In some cases (serialization for instance) the execution system will still need to form a stable and unique text string corresponding to any given program element, which it will get by transforming the binary string into text in some manner.

Schemes which obfuscate identifiers by reassigning them with sequential identifiers (a..z,aa..az,...) suffer a weakness here in that the assignment of identifiers is not inherently stable. For many programs this would not matter, but in this case, various programs and various program versions may need to inter-operate. Therefore, in such a scheme we would require a persistent database of the previous assignments.

In this application I've decided it's acceptable to trade the astronomically small probability of failure due to collision for greatly increased convenience by instead selecting the output identifiers (where x is the original identifier) according to:

f(x) = ~~hash( x )~~

This would satisfy the functionality requirements but would not be secure no matter what hash function was used; the original identifiers could easily be recovered thanks to their low entropy.

I figure this could be rectified by instead using:

f(x) = hash( secret + x )

"secret" would be a fixed, arbitrarily long random number which we can assume an attacker does not have a priori.

I would like to use MD5 for the hash, since f(x) would then be 16 bytes, which is a fairly convenient size. Other more secure hashes will produce larger identifiers, which I'd like to avoid if they're overkill.

I would have figured even MD5 here would be fine given the large secret, but I've read in numerous places that MD5 is broken.

Would any known attacks on MD5 make it artificially easy^[1] for an attacker to guess x knowing only the value of f(x) and that x is probably a small identifier formed from a fairly small dictionary? (I'm guessing no, but I'm not a security expert.)

I realize that if an attacker does reverse engineer one program, she could much more easily understand other programs using the same "secret" even though she doesn't have "secret"; this is acceptable. The user will decide when to change "secret" based on interoperability vs. security concerns.

^{[1]: I know very well that any program can be reverse engineered even without any programmatic identifiers, because I've done it. However I don't want to make it easier to obtain the identifiers mechanically than it is to manually reverse engineer the program.}

Nick Johnson · Accepted Answer

The scheme you describe will work, for a given value of 'work'.

First off, the known weaknesses in MD5 won't affect your application. That said, however, it's generally a bad idea to be using known weak cryptographic primitives, so it would likely make more sense to use a more robust hash. The output length is irrelevant: you can truncate it to whatever length you wish.

I trust you know, however, that this is obfuscation at best: a determined attacker will be able to reverse engineer the code even without useful identifiers, and that goes not just for the code your tool generates, but for the tool itself. If you're relying on a secret embedded in your tool's code to prevent dictionary attacks, you have to accept that an attacker will be able to extract that secret - at which point they can build a dictionary to produce the original identifiers, and your obfuscation becomes entirely useless.

You say you're concerned about stability of identifiers because of interoperability, but the usual solution here is simple: don't obfuscate identifiers that are part of the public interface of your code, only obfuscate internal details.

Safety of MD5 for obfuscating identifiers in program code

Answers (1)

Related Questions