Reputation: 65006
Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy
speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?
Upvotes: 1
Views: 100
Reputation: 315
I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.
That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)
Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!
Upvotes: 0
Reputation: 12455
I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.
The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).
Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.
Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.
Upvotes: 1