Reputation: 4010
I am trying to optimize a simple decompression routine, and came across this weird performance quirk that I can't seem to find much information on: Manually implemented trivial byte buffers are 10%-20% faster than the built in byte buffers (heap & mapped) for trivial operations (read one byte, read n bytes, is it end of stream)
I tested 3 APIs:
ByteBuffer.wrap(byte[])
The trivial wrapper:
class TestBuf {
private final byte[] ary;
private int pos = 0;
public TestBuf(ByteBuffer buffer) { // ctor #1
ary = new byte[buffer.remaining()];
buffer.get(ary);
}
public TestBuf(byte[] inAry) { // ctor #2
ary = inAry;
}
public int readUByte() { return ary[pos++] & 0xFF; }
public boolean hasRemaining() { return pos < ary.length; }
public void get(byte[] out, int offset, int length) {
System.arraycopy(ary, pos, out, offset, length);
pos += length;
}
}
The stripped-down core of my main loop is roughly a pattern of:
while (buffer.hasRemaining()) {
int op = buffer.readUByte();
if (op == 1) {
int size = buffer.readUByte();
buffer.get(outputArray, outputPos, size);
outputPos += size;
} // ...
}
I tested the following combos:
native-array
: Passing byte[]
to a byte[]
-accepting method (no copies)native-testbuf
: Passing byte[]
to a method that wrapped it in a TestBuf
(no copies, ctor #2)native-buffer
: Passing ByteBuffer.wrap(byte[])
to a ByteBuffer-accepting method (no copies)buffer-array
: Passing ByteBuffer.wrap(byte[])
to a method that extracted the ByteBuffer to an arraybuffer-testbuf
: Passing ByteBuffer.wrap(byte[])
to a method that extracted the ByteBuffer to an array in TestBuf
(ctor #1)I used JMH (blackholing each outputArray), and tested Java 17 on OpenJDK and GraalVM with a decompression corpus of ~5GiB preloaded into RAM, containing ~150,000 items averaging in size from 2KiB to 15MiB. Each corpus took ~10sec to decompress, and the JMH runs had proper warmup and iterations. I did strip the tests down the the minimal necessary non-array code, but even benchmarking the original code this came, from the difference is nearly the same percentage (ie. I don't think there is much else beyond the buffer/array acesses controlling the performance of my original code)
Across several computers the results were a bit jittery, but relatively consistent:
native-array
and native-testbuf
were the fastest options, tying within the margin of error (and under 0.5%) thanks to the optimizer (9.3s/corpus)native-buffer
was always the slowest option. This was always 17-22% slower than the fastest native-array
/native-testbuf
(11.4s/corpus)buffer-array
and buffer-testbuf
were in the middle of the pack within about 1% of each other, but about 4-7% slower than native-array
. However, despite the additional array copy they incurred, they were always significantly faster than native-buffer
by about 15-17%. (9.7s/corpus)Two of these results surprised me the most:
native-buffer
) is so slow compared to a custom simple ByteBuffer-like wrapper (native-testbuf
)buffer-*
) is still so much faster than using the ByteBuffer.wrap
object (native-buffer
)I've tried looking around for information on what I might be doing wrong, but most of the performance questions are about native memory and MappedByteBuffers
, whereas I am using HeapByteBuffers
, as far as I can tell. Why are HeapByteBuffers
so slow compared to my re-implementation for trivial read access? Is there some way I can use HeapByteBuffers
more efficiently? Does that also apply for MappedByteBuffer
?
Update: I've posted the full benchmark, corpus generator, and algorithms at https://gist.github.com/byteit101/84a3ab8f292de404e122562c7008c133 Note that while trying to get the corpus generator to work, I discovered that my 24-bit number was causing a performance penalty, so added a buffer-bufer
target, where copying a buffer to a new buffer and using the new buffer is faster than using the origianl buffer after the 24-bit number.
One run on one of my machines with the generated corpus:
Benchmark Mode Cnt Score Error Units
SOBench.t1_native_array ss 60 0.891 ± 0.018 s/op
SOBench.t2_buffer_testbuf ss 60 0.899 ± 0.024 s/op
SOBench.t3_buffer_buffer ss 60 0.935 ± 0.024 s/op
SOBench.t4_native_buffer ss 60 1.099 ± 0.024 s/op
Some more recent observations: deleting unused code (see comments in gist) makes ByteBuffer as fast as a native array, as does slight tweaks (changing bitmask conditionals for logical comparisons), so my current theory is that it's some inlining cache miss with something offset related too
Upvotes: 5
Views: 983
Reputation: 21
I think there is a regression with Java 17. I'm using a lib which processes a String. Several times new copies through #String.Split or #String.getBytes are being created. So I tried out an alternative implementation with ByteBuffer.
With Java 11 this solution is round about 30% faster then the original String based version.
time: 129 vs 180 ns/op
gc.alloc.rate: 2931 vs 3861 MB/sec
gc.count: 300 vs 323
gc.time: 172 vs 178 ms
With Java 17 it changed. The ByteBuffer version deteriorated, the String version improved.
time: 143 vs 146 ns/op
gc.alloc.rate: 2889 vs 4781 MB/sec
gc.count: 426 vs 586
gc.time: 240 vs 305 ms
Even gc.count and gc.time increased.
Upvotes: 2