Reputation:
As far as my (limited) knowledge of optimization goes, I always thought that compiler optimizations don't matter that much. I mean, we can surely gain quite a few percents (maybe 0...100% as a very rough first thoughtproximation) of time by using registers instead of memory and unrolling loops, but the fundamental factor that constrains the performance of code is really the choice of algorithms.
I've recently started a pet project - a small interpreted scripting language. It compiles to bytecode and executes it using a virtual machine. For the sake of easy debugging, profiling and testing, first I've naturally compiled the code with the -O0 -g
flags of gcc
(and clang
), then with -O2
. I've then time
'd a small program running on the VM, which basically does this (pseudo-code, I'm not showing you the actual syntax until the project goes public):
i = 1
sum = 0
while (i < 10000000) {
sum = (sum + i * i) mod 1000000
i++
}
print sum
This translates roughly to the following pseudo-assembly:
load r0, 0 # sum = 0
load r1, 1 # i = 1
load r2, 1000000
load r3, 10000000
loop:
mul r4, r1, r1 # i * i
add r0, r0, r4 # sum += i * i
mod r0, r0, r2 # sum %= 1000000
inc r1
gt r5, r1, r3 # if i > 10000000
jz r5, loop # then don't goto loop
ret r0
So basically, this is a tight loop with 10000000 iterations. time
reports that it runs for 0.47...0.52 seconds when compiled with -O2
, in 1.51...1.74 seconds when compiled with -O0 -g
and in 3.16...3.47 seconds when profiling (-pg
) is enabled as well.
As you can see, there's a 7-fold difference between the fastest and slowest execution time.
This in itself is not that surprising, because I've known that additional debugging information and the lack of small optimizations does indeed make code run slower, but now comes the interesting part. In order to have a more realistic perception of what actually happens, I've played the same game with Lua 5.2. Fiddling around with the Makefile
, I've found that the very same Lua program:
local sum = 0
for i = 1, 10000000 do
sum = (sum + i * i) % 1000000
end
print(sum)
runs in about 0.8...0.87 seconds when Lua is compiled with -O0 -g -pg
, and in 0.39...0.43 seconds when only -O2
is enabled.
So my code seems to have a 7x speed boost when the optimizer performs tricky fix-ups on it, whereas the Lua reference implementation seems to benefit from these improvements much less.
Now my questions are:
Any idea why this happens? I suspect that the primary reason is that the creators of Lua are smarter than me and know better what the compiler and their VM is doing.
I've also caught myself thinking "well this must be premature optimization; let me just pass it to the optimizer, it will do the job anyway" several times. These include one-line static functions called in the implementation of almost every VM instruction (I thought that they will be inlined when necessary anyway), the use of various complex expressions (sometimes with not-so-easy-to-predict side effects) that can be simplified, though. Does this attitude of mine count too? (And is it the right attitude, by the way?)
And anyway, should I worry about this phenomenon at all? My code runs 1.5 times slower than Lua, after all - and that's pretty good (at least it's good enough for me). Should I try to improve the performance of the debug build just because not doing so indicates that I lack intimate knowledge of my own code? Or may I just forget about it completely as far as the release (optimized) build is fast enough?
(If this question would be a better fit for Prog.SE or CodeReview, let me know and I'll migrate it.)
Upvotes: 5
Views: 626
Reputation: 57892
Firstly, your assertion that the algorithm is more important than the optimizations is generally true, but the same algorithm can be coded to make best or worst use of the platform you are executing on, so optimizations should always be considered... just avoid optimising prematurely, instead of avoiding optimising at all.
Next, remember that debug builds add a lot of overhead. Much more than merely disabling optimisation. To see what the optimiser is doing, use a release build with optimizations disabled.
The differences between Lua and your language will be due to the efficiency of your bytecode interpreter. A tiny inefficiency in here will have a massive effect on the speed of execution of such a large loop. you might also be able to add optimizations such as:
Lastly, don't worry about the efficiency of your debug code. When you have a working interpreter, then you can profile a release build to look for areas that you can improve. Doing this prematurely will not help, as there is no point optimising partially complete code and then finding out that it needs to be changed to support a new feature. And it's only when you have a working system that you can start writing typical scripts that will exercise your interpreter in a realistic manner - you may find that optimising a loop like the example above yields no benefit in day to day scripts.
Upvotes: 1