Reputation: 289
I am testing divs10 function throughput from hacker's delight book, coded in java on my jdk 1.7 64bit version 21 and i7 intel box processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
I am wondering why the default java operator / is faster than divs10 function from hacker's delight book, the result shows divs10 is 3 times slower than "/" operator, to my surprise.
anybody can tell me if there is any fancy intrinsic jvm can be using?
source code as below.
public class div10 {
public static final int divs10(int n) {
int q, r;
n = n + (n >> 31 & 9);
q = (n >> 1) + (n >> 2);
q += q >> 4;
q += q >> 8;
q += q >> 16;
q = q >> 3;
r = n - ((q << 3) + (q << 1));
return q + ((r + 6) >> 4);
}
public static void main(String[] args) {
/*
long count = 0;
for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
if ( (i/10) != divs10(i) ) {
System.err.println("error dividing :" + i );
}
if ((i & 0xFFFFFFF ) == 0 ) {
System.out.println("Finished:" + Long.toHexString(count) + ":" + count + ":" + i);
}
count++;
}
System.out.println("Success:" + count);
*/
long start = System.nanoTime();
long count = 0L;
int iter = 100_000;
for (int j = 0; j < 10; j++)
for (int i = -iter; i < iter; i++) {
count += (i/10);
}
for (int j = 0; j < 10; j++)
for (int i = -iter; i < iter; i++) {
count += divs10(i);
}
System.out.println(count + " warm up done ") ;
start = System.nanoTime();
count = 0L;
for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
count += i/10;
}
System.out.println(count + ", took:" + (System.nanoTime() - start) / 1000_000L + " ms, " + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ;
start = System.nanoTime();
count = 0L;
for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
count += divs10(i);
}
System.out.println(count + ", took:" + (System.nanoTime() - start) / 1000_000L + " ms, " + (System.nanoTime() - start) / ((long)Integer.MAX_VALUE - (long)Integer.MIN_VALUE) + " ns per ops" ) ;
}
}
Upvotes: 7
Views: 1487
Reputation: 43043
When you write:
count += (i/10);
The Java JIT is able to optimize the division per a constant using some nice tricks, like "Reciprocal Multiplication" - see this article for mathematical reference - or this one.
So it may be able to replace this division by a single multiplication + shift, which is in all cases, much faster than the circumvented divs10()
function, which may be fast with oldest CPUs, but is not with modern CPUs, for which an integer multiplication take 1 or 1.5 cycle! The "hacker's delight" trick did play nicely with a plain 386, but not with modern CPus.
Furthermore, the JIT may be able to unroll the loop, for even faster process, since computing count += ...
is easily paralleled.
Conclusion: when you work with a high-level language like Java, running on a VM with a JIT, do not expect to tell how the code would be compiled. Even any modern C compiler is able to optimize count += i/10
by using the "Reciprocal Multiplication" trick, or unroll the loop (even make it multi-threaded).
Let your (JIT) compiler do its work, and, if performance is not enough for you, optimize your data structures and algorithms rather than trying to "tweak" the compiler. If you want some documentation and source code of low-level performance tricks at CPU level, take a look at this reference material. But be aware that you won't be able to use those with Java (adding asm needs a C/C++ or Delphi compiler). Last but not least, remember that Premature optimization is the root of all evil (Knuth).
Upvotes: 1
Reputation: 68887
Update: When looking at the newer Ivy Bridge table (p. 174), I saw that all the latencies where 1. This means that my previous explanation was not correct.
An attempt in counting the instructions that are executed in the divs10
method is 27 (without overhead of function calling) instructions. You are having operations that require the previous one to be completed before the next one can start. So that means that you should consider the latency of the instructions. According to the Ivy Bridge instruction table, all of the instructions involved have a latency of 1 clock cycle. This gives you a total of 27 clock cycles.
This in comparison with a single IDIV (8-bit) instruction. In the table, I can find that this takes about 20 clock cycles latency.
A raw estimation would give: 27 cycles / 20 cycles = 1.35 times slower. This does not agree with the results you have. I'm not an expert at this, but I think this is due to the fact that divisions with the IDIV instruction can run in parallel, because they are independent. The IDIV instruction has a throughput of 8 clock cycles. Which allows the CPU to optimize the instructions in that way that it can run about 4 divisions per 52 cycles (this is an estimation).
So, to perform 4 divisions with the bit-shifting algorithm, you would need 108 cycles, whereas the IDIV would need approximately 64 clock cycles. This gives: 108 / 52 = 2.1 times slower.
This gets close to the ratio you measured. I guess that the remaining extra time goes to overhead of function calling. Maybe CPU's do greater optimizations than my estimation.
Upvotes: 5