Reputation: 189
I've been running some tests to see how inlining function code (explicitly writing function algorithms in the code itself) affects performance. I wrote a simple byte array to integer code and then wrapped it in a function, called it statically from another class, and called it statically from the class itself. The code is as follows:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n = new byte[4];
long start;
System.out.println("Function from Static Class =================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
StaticClass.toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
System.out.println("Function from Class ========================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
int actual = 0;
int len = n.length;
System.out.println("Inline Function ============================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
The results are as follows:
Function from Static Class =================
Elapsed time: 0.096559931s
Function from Class ========================
Elapsed time: 0.015741711s
Inline Function ============================
Elapsed time: 0.837626286s
Is there something weird going on with the bytecode? I've looked at the bytecode myself, but I'm not very familiar and I can't make heads or tails of it.
EDIT
I added assert
statements to read the outputs and then randomized the bytes read and the benchmark now behaves the way I thought it would. Thanks to Tomasz Nurkiewicz, who pointed me to the microbenchmark article. The resulting code is thus:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n;
long start, end;
int checker, calc;
end = 0;
System.out.println("Function from Object =================");
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = StaticClass.toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
end = 0;
System.out.println("Function from Class ==================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
int len = 4;
end = 0;
System.out.println("Inline Function ======================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
calc = 0;
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
for (int j = 0; j < len; j++) {
calc += n[len - 1 - j] << 8 * j;
}
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static byte [] toByte(int val) {
byte [] n = new byte[4];
for (int i = 0; i < 4; i++) {
n[i] = (byte)((val >> 8 * i) & 0xFF);
}
return n;
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
Results:
Function from Static Class =================
Elapsed time: 9.276437031s
Function from Class ========================
Elapsed time: 9.225660708s
Inline Function ============================
Elapsed time: 5.9512E-5s
Upvotes: 2
Views: 1555
Reputation: 340903
I ported your test case to caliper:
import com.google.caliper.SimpleBenchmark;
public class ToInt extends SimpleBenchmark {
private byte[] n;
private int total;
@Override
protected void setUp() throws Exception {
n = new byte[4];
}
public int timeStaticClass(int reps) {
for (int i = 0; i < reps; i++) {
total += StaticClass.toInt(n);
}
return total;
}
public int timeFromClass(int reps) {
for (int i = 0; i < reps; i++) {
total += toInt(n);
}
return total;
}
public int timeInline(int reps) {
for (int i = 0; i < reps; i++) {
int actual = 0;
int len = n.length;
for (int i1 = 0; i1 < len; i1++) {
actual += n[len - 1 - i1] << 8 * i1;
}
total += actual;
}
return total;
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
class StaticClass {
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
And indeed seems like inlined version is the slowest while two static versions are almost the same (as expected):
The reasons are hard to imagine. I can think of two factors:
JVM is better in performing micro-optimizations when code blocks are as small and simple to reason about as possible. When the function is inlined, the whole code becomes more complex and JVM gives up. With smaller toInt()
function it JIT is more clever
cache locality - somehow JVM performs better with two small chunks of code (loop and method) rather than one bigger
Upvotes: 3
Reputation: 63104
Your test is flawed. The second test is having the benefit of the first test already being run. You need to run each test case in its own JVM invocation.
Upvotes: 0
Reputation: 533710
You have several problems, but the main one is that you are testing one iteration of one optimised code. That is sure to give you mixed results. I suggest running the test for 2 seconds, ignoring the first 10,000 iterations or so.
If the result of a loop is not kept, the entire loop can be discarded after some random interval.
Breaking each test into a separate method
public class FunctionCallSpeed {
public static final int numIter = 50000000;
private static int dontOptimiseAway;
public static void main(String[] args) {
byte[] n = new byte[4];
for (int i = 0; i < 10; i++) {
test1(n);
test2(n);
test3(n);
System.out.println();
}
}
private static void test1(byte[] n) {
System.out.print("from Static Class: ");
long start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = FunctionCallSpeed.toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test2(byte[] n) {
long start;
System.out.print("from Class: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test3(byte[] n) {
long start;
int actual = 0;
int len = n.length;
System.out.print("Inlined: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
dontOptimiseAway = actual;
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
prints
from Class: 7ns Inlined: 11ns from Static Class: 9ns
from Class: 6ns Inlined: 8ns from Static Class: 8ns
from Class: 6ns Inlined: 9ns from Static Class: 6ns
This suggest that when the inner loop is optimised separately it is slightly more efficient.
However if I use an optimised conversion of bytes to int
public static int toInt(byte[] num) {
return num[0] + (num[1] << 8) + (num[2] << 16) + (num[3] << 24);
}
all the tests report
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
as its realised the test doesn't do anything useful. ;)
Upvotes: 3
Reputation: 82579
It's always hard to make a guarantee of what the JIT is doing, but if I had to guess, it noticed the return value of the function was never being used, and optimized a lot of it out.
If you actually use the return value of your function I bet it changes the speed.
Upvotes: 5