Reputation: 31
Sorting-Algorithms get faster (in Java)?!
I have implemented some Sortingalgorithms and a method getNanoTime which gives the NanoTime of this Sorting-Algorithm.
I wanted to calculate an average-value. I recognized that the average-times are different to the time when testing the algorithm one time.
I thought I did something wrong.
But then I found it.
When doing:
int length = 5000;
int bereich = 1000;
long time;
time = Bubblesort.getNanoTime(length, bereich);
System.out.println("BUBBLESORT: " + (1.0 * time / 1_000_000) + " ms");
time = Insertionsort.getNanoTime(length, bereich);
System.out.println("INSERTIONSORT: " + (1.0 * time / 1_000_000) + " ms");
time = Mergesort.getNanoTime(length, bereich);
System.out.println("MERGESORT: " + (1.0 * time / 1_000_000) + " ms");
time = Quicksort.getNanoTime(length, bereich);
System.out.println("QUICKSORT: " + (1.0 * time / 1_000_000) + " ms");
time = Selectionsort.getNanoTime(length, bereich);
System.out.println("SELECTIONSORT: " + (1.0 * time / 1_000_000) + " ms");
I got:
BUBBLESORT: 75.7835 ms
INSERTIONSORT: 27.250875 ms
MERGESORT: 17.450083 ms
QUICKSORT: 7.092709 ms
SELECTIONSORT: 967.638792 ms
But when doing for example:
for (int i = 0; i < 20; i++) {
System.out.println(1.0 * Bubblesort.getNanoTime(5000, 1000) / 1_000_000);
}
I got:
85.473625 ms
62.681959 ms
68.866542 ms
48.737333 ms
47.402708 ms
47.368708 ms
47.567792 ms
47.018042 ms
45.1795 ms
47.871416 ms
49.570208 ms
50.285875 ms
56.37975 ms
50.342917 ms
50.262833 ms
50.036959 ms
50.286542 ms
51.752708 ms
50.342458 ms
51.511541 ms
The first time is always high (here is the first time 85 ms) and the times after the first time are lower. So, I think, the machine learns, and it becomes faster
Could that be? Do you know more?
Upvotes: 2
Views: 369
Reputation: 102812
I think, the machine learns, and it becomes faster
Yup.
Look up Just-in-time compilation and while you're at it, spend a few weeks becoming a rocket scientist so that you can fully understand how CPU caches work.
Or, if you don't want to spend the next 10 weeks studying but you do want a much better idea of how any of this works, read the rest of this answer, then go look at this talk by Douglas Hawkins about JVM performance puzzlers. I bet after watching those 40 minutes you will feel fully enlightened about this conundrum.
Two things are going on here (JIT warmup effect, and cachepage effect), possibly more:
The JIT is 'warming up': The way java works, is that it runs your class file code in the dumbest, slowest, stupidest way possible, and is in addition wasting even more time maintaining a boatload of bookkeeping on the code, such as 'how often does this if
block enter vs how often it is skipped?' for no good reason. Everything runs slow as molasses. Intentionally so, really.
But then... because of all that bookkeeping the JVM at some point goes: Huh. Literally (and I'm not overstating the case here, this is very common!) 99% of the CPU's time is spent on this 0.1% of the entire code base.
It then takes some time to analyse the daylights out of that 0.1%, creating a very finely tuned machine code version that is exactly perfect for the actual CPU you're running on. This takes a lot of time, will use all that bookkeeping (which is, after all, not so pointless!) to do things such as re-order code so that the most-often taken 'branch' in an if/else block is the one that can run without code jumps (which are slow due to pipeline resets), and will even turn currently-observed-truths-which-may-not-hold-later into assumptions. As in, the code is 'compiled' into machine code that will straight up not work if these assumptions (that so far have been, due to all that bookkeeping, observed to always be true) end up being false, and will then add hooks throughout the entire VM that if any code happens to break the assumption, the finely crafted machine code is flagged as now invalid and will no longer be used. For example, if you have a class that is not final
, it can be extended. And java is always dynamic dispatch: If you invoke foo.hello()
, you get the hello()
implementation of the actual type of the object the foo
variable is pointing at, not the type of the foo
expression itself. Classloading in java is inherently dynamic (classes can be loaded at any time, a JVM never knows that it is 'done loading classes'. This means a lookup table must be involved. A pricey annoyance! But, hotspot optimizer gets around it and eliminates the table: If the optimizer figures out that a non-final class is nevertheless not currently extended, or all extensions do not overwrite the implementation in question, then it can omit the lookup table and just link a method call straight to the implementation. It also adds hooks to the classloader that if ever any class is loaded that extends the targeted class (and changes the impl of the relevant method), the machine code that directly jumps straight to the impl is invalidated. That method's actual performance takes a nosedive again, as the JVM goes back to the slow-as-molasses way. If it's still being run a lot, no worries. hotspot will do another pass, this time taking into account there are multiple implementations.
Once this machine code is available, all calls to this method are redirected to run using this finely tuned machine code. Which is incredibly fast; usually faster than -O3
compiled C code, in fact, because the JVM gets the benefit of taking runtime behaviour into account which a C compiler can never do.
Nevertheless usually only about 1% of all code in a VM is actually running in this mode. The simple truth is that almost all code of any app is irrelevant performance wise. It doesn't do anything complicated, isn't running during 'hot' moments, and it just. doesn't. matter. The smartypants analysis is left solely for the 1% or so that is actually going to run lots.
And that's what likely explains a large chunk of that difference: A bunch of your sort algorithm's loops are running in dogslow (not-hotspotted) mode, whereas once the hotspotting is done during your first sort run, the next sort runs get the benefit of the hotspotted code from the start.
Secondarily, data needs to be in a cache page for the CPU to really do it quickly. Often repeating a calculation means the first run through gets the penalty of the CPU having to swap in a bunch of cache pages, whereas all future runs don't need to pay that price as the relevant parts of memory are already in the caches.
The conclusion is simple: Micro-benchmarking like this is incredibly complicated, you can't just time it with System.nanoTime, JVMs are incredibly complicated, CPUs are incredibly complicated, even the wintered engineers that spend their days writing the JVM itself are on record as stating they are far too stupid to guess performance like this. So you stand absolutely no chance whatsoever.
The solution is, fortunately, also very very simple. Those same JVM engineers wanna know how fast stuff runs so they wrote an entire framework that lets you micro-benchmark, actively checking for hotspot warmups, doing a bunch of dry runs, ensuring that the optimizer doesn't optimize away your entire algorithm (which can happen, if you sort a list and then toss the list in the garbage, the optimizer might just figure out that the entire sort op is best optimized by skipping it entirely, because, hey, if nobody actually cares about the results of the sort, why sort, right? You need a 'sink' to ensure that the optimizer doesn't conclude it can just junk the whole thing due to the data being discarded!) - and it is called JMH. Rewrite your benchmark in JMH and let it rip. You'll find that it times consistently, and those times are generally meaningful (vs what you wrote, which means mostly nothing).
Upvotes: 8