Rinke
Rinke

Reputation: 6332

(How) does the Java JIT compiler optimize my code?

I'm writing fairly low level code that must be highly optimized for speed. Every CPU cycle counts. Since the code is in Java I can't write as low level as in C for example, but I want to get everything out of the VM that I can.

I'm processing an array of bytes. There are two parts of my code that I'm primarily interested in at the moment. The first one is:

int key =  (data[i]     & 0xff)
        | ((data[i + 1] & 0xff) <<  8)
        | ((data[i + 2] & 0xff) << 16)
        | ((data[i + 3] & 0xff) << 24);

and the second one is:

key = (key << 15) | (key >>> 17);

Judging from the performance I'm guessing that these statements aren't optimized the way I expect. The second statement is basically a ROTL 15, key. The first statement loads 4 bytes into an int. The 0xff masks are there only to compensate for the added sign bits resulting from the implicit cast to int if the accessed byte happens to be negative. This should be easy to translate to efficient machine code, but to my surprise performance goes up if I remove the masks. (Which of course breaks my code, but I was interested to see what happens.)

What's going on here? Do the most common Java VMs optimize this code during JIT in the way one would expect a good C++ compiler to optimize the equivalent C++ code? Can I influence this process? Setting -XX:+AggressiveOpts seems to make no difference.

(CPU: x64, Platform: Linux/HotSpot)

Upvotes: 17

Views: 3529

Answers (4)

solendil
solendil

Reputation: 8468

I've done a lot of performance code in Java, I've even coded directly in Bytecode, enough to be sure of a couple of thing : the JIT is a black box with obscure behaviours, the JIT and compilers are incredibly efficient, and the simplest code usually yield the best performance.

This is normal when you think about the GOAL of the JIT: extract the best possible performance from any Java code. When you add that Java is quite a simple and plain language, the simple code will be optimized, and any further trick will generally do no good.

Of course, there are some common pitfalls and gotchas you ought to know, but I see none in your code samples. Were I to optimize your code, I would go straight to the higher level: algorithm. What is the complexity of your code? Can some data be cached? What APIs are used? Etc... There's a seemingly endless pit of performance to be extracted from algorithmic tricks alone.

And if even this is not sufficient, if the language is not fast enough, if your machine is not fast enough, if your algorithm cannot be made any faster, the answer won't lie in "clock cycles", because you might squeeze 20% of efficiency, but 20% will never be enough when your data grow. To be sure you will never hit (again) a performance wall, the ultimate answer lies in scalability: make your algorithm and your data endlessly distributable so you can just throw more workers to the task.

Upvotes: 5

Ingo
Ingo

Reputation: 36329

You don't need to (& 0xff) before shifting 24 bits to the left.

Upvotes: 0

jmg
jmg

Reputation: 7404

I do agree with solendil, but if you want to dig deeper at the low-level, try getting the code produced by the JIT as described here.

Upvotes: 3

Related Questions