Reputation:
int steps = 256 * 1024 * 1024;
int[] a = new int[2];
// Loop 1
for (int i=0; i<steps; i++) { a[0]++; a[0]++; }
// Loop 2
for (int i=0; i<steps; i++) { a[0]++; a[1]++; }
Can someone explain why the second loop is 20x times slower than the first (19 ms vs 232 ms)?
That is how I'm timing it:
long start_time = System.currentTimeMillis();
// Loop
long end_time = System.currentTimeMillis();
System.out.println(end_time - start_time);
Upvotes: 5
Views: 274
Reputation: 33519
The JIT compiler is converting the first loop into a multiply, but not optimizing the second loop very much.
The bytecode for both loops is basically the same (you can view this with javap -c test.class
).
In Java, the bytecode is converted into x86 instructions by a JIT compiler which has the ability to perform additional optimizations.
You can actually view the assembly produced by the JIT via java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly ...
if you have the hsdis plugin.
I changed the value you add to each element to 0xbad to make it easier to spot the relevant code and changed the loop counter to long
.
The first loop produces:
mov r11d,dword ptr [r13+10h] Load from memory a[0]
...
add r11d,175ah Add 2 * 0xbad to the value
mov dword ptr [r13+10h],r11d Store to memory a[0]
The second loop produces:
mov ebx,dword ptr [rax+10h] Load from memory a[0]
add ebx,0badh Add 0xbad
...
mov dword ptr [rax+10h],ebx Store to memory
...
mov ebx,dword ptr [rax+14h] Load from memory a[1]
add ebx,0badh Add 0xbad
...
mov dword ptr [rax+14h],ebx Store to memory a[1]
so you can see that the compiler is already able to optimize the first loop into fewer instructions.
In particular, it has spotted that the two additions to the same array element can be coalesced into a single addition of twice the value.
When I changed the loop counter back to int
I noticed that the compiler manages to do even better with your first loop:
mov r10d,dword ptr [r14+10h]
imul ecx,r13d,175ah This line converts lots of adds of 0xbad into a single multiply
mov r11d,r10d
sub r11d,ecx
add r10d,175ah
mov dword ptr [r14+10h],r10d
In this case it has spotted that it can actually implement several iterations of your loop in a single pass by using a multiplication! This explains how the first loop can be an order of magnitude faster than the second.
Upvotes: 10