Reputation: 319
I have this piece of code doing Range Minimum Query. When t = 100000, i and j are always changed in each input line, its execution time in Java 8u60 is about 12 secs.
for (int a0 = 0; a0 < t; a0++) {
String line = reader.readLine();
String[] ls = line.split(" ");
int i = Integer.parseInt(ls[0]);
int j = Integer.parseInt(ls[1]);
int min = width[i];
for (int k = i + 1; k <= j; k++) {
if (min > width[k]) {
min = width[k];
}
}
writer.write(min + "");
writer.newLine();
}
When I extract a new method to find minimum Value, the execution time is 4 times faster (about 2.5 secs).
for (int a0 = 0; a0 < t; a0++) {
String line = reader.readLine();
String[] ls = line.split(" ");
int i = Integer.parseInt(ls[0]);
int j = Integer.parseInt(ls[1]);
int min = getMin(i, j);
writer.write(min + "");
writer.newLine();
}
private int getMin(int i, int j) {
int min = width[i];
for (int k = i + 1; k <= j; k++) {
if (min > width[k]) {
min = width[k];
}
}
return min;
}
I always thought that method calls are slow. But this example shows the opposite. Java 6 also demonstrates this, but the execution times is much slower in both case (17 secs and 10 secs). Can someone provide some insight into this?
Upvotes: 12
Views: 3898
Reputation: 3618
The question does not provide a reproducible test case. So I built one that focuses solely on computing the ranged minima:
git clone [email protected]:lemire/microbenchmarks.git
cd microbenchmarks
mvn clean install
java -cp target/microbenchmarks-0.0.1-jar-with-dependencies.jar me.lemire.microbenchmarks.rangequery.RangeMinimum
My results are (on a server configured for testing, Java 8):
m.l.m.r.RangeMinimum.embeddedmin avgt 5 0.053 ± 0.009 ms/op
m.l.m.r.RangeMinimum.fncmin avgt 5 0.052 ± 0.003 ms/op
So there is no significant performance difference in my test case between having one large loop with a subloop, or having one loop containing a function call. Note that the benchmark calls functions multiple times so that the JIT compiler can do its work.
Upvotes: 1
Reputation: 98304
TL;DR JIT compiler has more opportunities to optimize the inner loop in the second case, because on-stack replacement happens at the different point.
I've managed to reproduce the problem with the reduced test case.
No I/O or string operations involved, just two nested loops with array access.
public class NestedLoop {
private static final int ARRAY_SIZE = 5000;
private static final int ITERATIONS = 1000000;
private int[] width = new java.util.Random(0).ints(ARRAY_SIZE).toArray();
public long inline() {
long sum = 0;
for (int i = 0; i < ITERATIONS; i++) {
int min = width[0];
for (int k = 1; k < ARRAY_SIZE; k++) {
if (min > width[k]) {
min = width[k];
}
}
sum += min;
}
return sum;
}
public long methodCall() {
long sum = 0;
for (int i = 0; i < ITERATIONS; i++) {
int min = getMin();
sum += min;
}
return sum;
}
private int getMin() {
int min = width[0];
for (int k = 1; k < ARRAY_SIZE; k++) {
if (min > width[k]) {
min = width[k];
}
}
return min;
}
public static void main(String[] args) {
long startTime = System.nanoTime();
long sum = new NestedLoop().inline(); // or .methodCall();
long endTime = System.nanoTime();
long ms = (endTime - startTime) / 1000000;
System.out.println("sum = " + sum + ", time = " + ms + " ms");
}
}
inline
variant indeed works 3-4 times slower than methodCall
.
I've used the following JVM options to confirm that both benchmarks are compiled on the highest tier and OSR (on-stack replacement) successfully occurs in both cases.
-XX:-TieredCompilation
-XX:CompileOnly=NestedLoop
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintCompilation
-XX:+TraceNMethodInstalls
251 46 % NestedLoop::inline @ 21 (70 bytes)
Installing osr method (4) NestedLoop.inline()J @ 21
271 46 NestedLoop::getMin (41 bytes)
Installing method (4) NestedLoop.getMin()I
274 47 % NestedLoop::getMin @ 9 (41 bytes)
Installing osr method (4) NestedLoop.getMin()I @ 9
314 48 % NestedLoop::methodCall @ 4 (30 bytes)
Installing osr method (4) NestedLoop.methodCall()J @ 4
This means JIT does its job, but the generated code must be different.
Let's analyze it with -XX:+PrintAssembly
.
0x0000000002df4dd0: inc %ebp ; OopMap{r11=Derived_oop_rbx rbx=Oop off=114}
;*goto
; - NestedLoop::inline@53 (line 12)
0x0000000002df4dd2: test %eax,-0x1d64dd8(%rip) # 0x0000000001090000
;*iload
; - NestedLoop::inline@21 (line 12)
; {poll}
0x0000000002df4dd8: cmp $0x1388,%ebp
0x0000000002df4dde: jge 0x0000000002df4dfd ;*if_icmpge
; - NestedLoop::inline@26 (line 12)
0x0000000002df4de0: test %rbx,%rbx
0x0000000002df4de3: je 0x0000000002df4e4c
0x0000000002df4de5: mov (%r11),%r10d ;*getfield width
; - NestedLoop::inline@32 (line 13)
0x0000000002df4de8: mov 0xc(%r10),%r9d ; implicit exception
0x0000000002df4dec: cmp %r9d,%ebp
0x0000000002df4def: jae 0x0000000002df4e59
0x0000000002df4df1: mov 0x10(%r10,%rbp,4),%r8d ;*iaload
; - NestedLoop::inline@37 (line 13)
0x0000000002df4df6: cmp %r8d,%r13d
0x0000000002df4df9: jg 0x0000000002df4dc6 ;*if_icmple
; - NestedLoop::inline@38 (line 13)
0x0000000002df4dfb: jmp 0x0000000002df4dd0
0x0000000002da2af0: add $0x8,%edx ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2af3: cmp $0x1381,%edx
0x0000000002da2af9: jge 0x0000000002da2b70 ;*iload_1
; - NestedLoop::getMin@16 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2afb: mov 0x10(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b00: cmp %r11d,%ecx
0x0000000002da2b03: jg 0x0000000002da2b6b ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b05: mov 0x14(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b0a: cmp %r11d,%ecx
0x0000000002da2b0d: jg 0x0000000002da2b5c ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b0f: mov 0x18(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b14: cmp %r11d,%ecx
0x0000000002da2b17: jg 0x0000000002da2b4d ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b19: mov 0x1c(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b1e: cmp %r11d,%ecx
0x0000000002da2b21: jg 0x0000000002da2b66 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b23: mov 0x20(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b28: cmp %r11d,%ecx
0x0000000002da2b2b: jg 0x0000000002da2b61 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b2d: mov 0x24(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b32: cmp %r11d,%ecx
0x0000000002da2b35: jg 0x0000000002da2b52 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b37: mov 0x28(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b3c: cmp %r11d,%ecx
0x0000000002da2b3f: jg 0x0000000002da2b57 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b41: mov 0x2c(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b46: cmp %r11d,%ecx
0x0000000002da2b49: jg 0x0000000002da2ae6 ;*if_icmple
; - NestedLoop::getMin@23 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b4b: jmp 0x0000000002da2af0
The compiled code is completely different; methodCall
is optimized much better.
width
field is cached in the register.In contrast, inline
variant
width
array from memory every time;OSR-compiled methods are not always optimized very well, because they have to maintain the state of an interpreted stack frame at the transition point. Here is another example of the same problem.
On-stack replacement usually occurs on backward branches (i.e. at the bottom of the loop). inline
method has two nested loops, and OSR happens inside the inner loop, while methodCall
has just one outer loop. OSR transition in the outer loop in more favourable, because JIT compiler has more freedom to optimize the inner loop. And this is what exactly happens in your case.
Upvotes: 16
Reputation: 73
I believe java is doing some optimization/memoization. It could cache the results of functions if functions/methods are pure. I believe your time has decreased but your space/memory will increase (due to memoization) and vice versa.
Upvotes: -2
Reputation: 171
One advantage of Java over languages compiled as C ++ is that JIT (Just in time compiler) can do optimizations from the bytecode at the time the code is executed. In addition, the Java compiler itself is prepared to do several optimizations already in the build phases. These techniques allow, for example, to turn a method call into inline code within a loop, thus avoiding repetitive method search overhead in polymorphic calls. Making a method call run inline means that the method code runs as if it had been written directly in the place where the method is called. Thus, there is no search overhead of the method to be executed, memory allocation, new context variables. Basically within your loop for, loss of processing occurs when you allocate new variables in memory (int k for example), when you pass this for into a method, you end up decreasing the overhead because the variables will already be allocated to This execution
Upvotes: 1
Reputation: 5496
Without doing an actual analysis, getMin
is most likely being JIT compiled as you extracted it to a method that's called many times. If you are using the HotSpot JVM, this happens by default after 10,000 method executions.
You can always inspect the final code used by your application by using the right flags and JVM builds. Check out the question/answer How to see JIT-compiled code in JVM for an example how.
Upvotes: 4