d33tah
d33tah

Reputation: 11561

Do x86 obfuscation techniques make instruction timing more predictable?

I heard that determining the exact time it will take to execute an instruction is impossible on x86 because of things like pipelining greatly complicating the process. Is there a way to make those mechanisms less effective in order to be able to predict the instruction run time? Would obfuscation tools like movfuscator help here?

Upvotes: 0

Views: 260

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364180

If much slower execution is acceptable, you could try to keep the CPU in self-modifying-code handling mode. I'm not sure how much more predictable it is. It's so slow that nobody bothers to measure performance characteristics. (The relevant performance counter is MACHINE_NUKES.SMC, which should give you an idea of what it does to the OOO pipeline.)

Mix in the occasional or [rip+32], 0 or something. x86 is guaranteed to detect self-modifying code after a jump, so doing a no-op OR with zero on the jump target right before jumping might be a good way to make sure you're doing read-modify-write ops on code that is about to run.


The M/o/Vfuscator could make execution a bit more predictable. You would never have branch-mispredictions, because everything is done with stuff like

mov [Ri], 0
mov [Rj], 1
mov Rk, [Ri]  ; Rk = 1 if  Ri==Rj

However, cache misses and front-end bottlenecks will still make execution quite variable.

Other obfuscation techniques like jumping into the middle of an instruction (which is carefully chosen to decode to a different but also valid instruction) is a completely different obfuscation technique. It shouldn't have much impact on performance or variability of instruction timing. So the question isn't very well-posed: It makes no sense to lump different obfuscation techniques together when asking about this.


It's not that hard to work out theoretical throughput / latency numbers for modern out-of-order machines. In practice, there are always extra factors that slow things down. e.g. Intel Skylake in theory can do 2 loads and one store per clock, but Agner Fog reports that only 40%-60% of that is usually achieved in real code. An artificial test loading / storing with the same addresses all the time still wouldn't achieve 100% of theoretical speed, because there are always microarchitectural stumbling blocks.

See Significant FMA performance anomaly experienced in the Intel Broadwell processor this question for another example.

However, in some simple loops you can see very consistent performance: In Micro fusion and addressing modes, my test loops gave highly repeatable counts (for many executions). I wouldn't bet on repeatability for the first execution of some code, even if you could measure it accurately, though. Things are more predictable when the caches are hot, esp. when running a small loop from the uop loop cache.

Upvotes: 1

Martin Konecny
Martin Konecny

Reputation: 59601

make those mechanisms less effective

Perhaps I misunderstand you, but pipelining is done for performance reasons, not to muddle with prediction of execution times.

There are also other factors such as cache (does the data being read need to be fetched from memory or is it already in l1/l2/l3 cache?)

Regarding being able to predict execution time, I don't believe it's possible to predict this for individual instructions, but real-time OS may interest you - it places an upper bound on execution time at a less granular level:

https://en.wikipedia.org/wiki/Real-time_operating_system

Upvotes: 0

Related Questions