Reputation: 418
While I was testing the read performance of a direct java.nio.ByteBuffer I noticed that the absolute read is on average 2x times faster than the relative read. Also if I compare the source code of the relative vs absolute read, the code is pretty much the same except that the relative read maintains and internal counter. I wonder why do I see such a considerable difference in speed?
Below is the source code of my JMH benchmark:
public class DirectByteBufferReadBenchmark {
private static final int OBJ_SIZE = 8 + 4 + 1;
private static final int NUM_ELEM = 10_000_000;
@State(Scope.Benchmark)
public static class Data {
private ByteBuffer directByteBuffer;
@Setup
public void setup() {
directByteBuffer = ByteBuffer.allocateDirect(OBJ_SIZE * NUM_ELEM);
for (int i = 0; i < NUM_ELEM; i++) {
directByteBuffer.putLong(i);
directByteBuffer.putInt(i);
directByteBuffer.put((byte) (i & 1));
}
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public long testReadAbsolute(Data d) throws InterruptedException {
long val = 0l;
for (int i = 0; i < NUM_ELEM; i++) {
int index = OBJ_SIZE * i;
val += d.directByteBuffer.getLong(index);
d.directByteBuffer.getInt(index + 8);
d.directByteBuffer.get(index + 12);
}
return val;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public long testReadRelative(Data d) throws InterruptedException {
d.directByteBuffer.rewind();
long val = 0l;
for (int i = 0; i < NUM_ELEM; i++) {
val += d.directByteBuffer.getLong();
d.directByteBuffer.getInt();
d.directByteBuffer.get();
}
return val;
}
public static void main(String[] args) throws Exception {
Options opt = new OptionsBuilder()
.include(DirectByteBufferReadBenchmark.class.getSimpleName())
.warmupIterations(5)
.measurementIterations(5)
.forks(3)
.threads(1)
.build();
new Runner(opt).run();
}
}
And these are the results of my benchmark run:
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 15 88.605 ± 9.276 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 15 42.904 ± 3.018 ops/s
The test was run on a MacbookPro (2.2GHz Intel Core i7, 16Gb DDR3) and JDK 1.8.0_73.
UPDATE
I run the same test with JDK 9-ea b134. Both test show a ~10% speed increase but the speed difference between the two remains similar.
# JMH 1.13 (released 45 days ago)
# VM version: JDK 9-ea, VM 9-ea+134
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/bin/java
# VM options: <none>
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 15 102.170 ± 10.199 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 15 45.988 ± 3.896 ops/s
Upvotes: 14
Views: 2240
Reputation: 98334
JDK 8 indeed generates worse code for the loop with relative ByteBuffer access.
JMH has built-in perfasm
profiler that prints generated assembly code for the hottest regions. I've used it to compare the compiled testReadAbsolute
vs. testReadRelative
, and here are the main differences:
Relative getLong / getInt/ get
update position field of the ByteBuffer
. VM does not optimize these updates: there are 3 memory writes on each loop iteration.
position
range check is not eliminated: conditional branches on each loop iteration remained in compiled code.
Since redundant field updates and range checks make the loop body longer, VM unrolls only 2 iterations of the loop. The compiled version for the loop with absolute access has 16 iterations unrolled.
testReadAbsolute
is compiled very well: the main loop just reads 16 longs, sums them up and jumps to the next iteration if index < 10_000_000 - 16
. The state of directByteBuffer
is not updated. However, JVM is not that smart for testReadRelative
: seems like it cannot optimize field access of an object from outside.
There was much work in JDK 9 to optimize ByteBuffer. I've run the same test on JDK 9-ea b134, and verified that testReadRelative
does not have redundant memory writes and range checks. Now it runs almost as fast as testReadAbsolute
.
// JDK 1.8.0_92, VM 25.92-b14
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 10 99,727 ± 0,542 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 10 47,126 ± 0,289 ops/s
// JDK 9-ea, VM 9-ea+134
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 10 109,369 ± 0,403 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 10 97,140 ± 0,572 ops/s
UPDATE
In order to help JIT compiler with optimization I've introduced local variable
ByteBuffer directByteBuffer = d.directByteBuffer
in both benchmarks. Otherwise level of indirection does not allow compiler to eliminate ByteBuffer.position
field updates.
Upvotes: 19