How to make ByteBuffer as efficient as direct byte[] access after JIT has warmed up?

Question

I am trying to optimize a simple decompression routine, and came across this weird performance quirk that I can't seem to find much information on: Manually implemented trivial byte buffers are 10%-20% faster than the built in byte buffers (heap & mapped) for trivial operations (read one byte, read n bytes, is it end of stream)

I tested 3 APIs:

Methods on ByteBuffer.wrap(byte[])
Raw byte[] accesses
Methods on trivial wrapper of byte accesses that (mostly) mirrors the ByteBuffer API

The trivial wrapper:

class TestBuf {
    private final byte[] ary;
    private int pos = 0;

    public TestBuf(ByteBuffer buffer) {  // ctor #1
        ary = new byte[buffer.remaining()];
        buffer.get(ary);
    }
    
    public TestBuf(byte[] inAry) { // ctor #2
        ary = inAry;
    }

    public int readUByte() { return ary[pos++] & 0xFF; }

    public boolean hasRemaining() { return pos < ary.length; }

    public void get(byte[] out, int offset, int length) {
        System.arraycopy(ary, pos, out, offset, length);
        pos += length;
    }
}

The stripped-down core of my main loop is roughly a pattern of:

while (buffer.hasRemaining()) {
    int op = buffer.readUByte();
    if (op == 1) {
        int size = buffer.readUByte();
        buffer.get(outputArray, outputPos, size);
        outputPos += size;
    } // ...
}

I tested the following combos:

native-array: Passing byte[] to a byte[]-accepting method (no copies)
native-testbuf: Passing byte[] to a method that wrapped it in a TestBuf (no copies, ctor #2)
native-buffer: Passing ByteBuffer.wrap(byte[]) to a ByteBuffer-accepting method (no copies)
buffer-array: Passing ByteBuffer.wrap(byte[]) to a method that extracted the ByteBuffer to an array
buffer-testbuf: Passing ByteBuffer.wrap(byte[]) to a method that extracted the ByteBuffer to an array in TestBuf (ctor #1)

I used JMH (blackholing each outputArray), and tested Java 17 on OpenJDK and GraalVM with a decompression corpus of ~5GiB preloaded into RAM, containing ~150,000 items averaging in size from 2KiB to 15MiB. Each corpus took ~10sec to decompress, and the JMH runs had proper warmup and iterations. I did strip the tests down the the minimal necessary non-array code, but even benchmarking the original code this came, from the difference is nearly the same percentage (ie. I don't think there is much else beyond the buffer/array acesses controlling the performance of my original code)

Across several computers the results were a bit jittery, but relatively consistent:

GraalVM was usually slower than OpenJDK by about 10-15% (this surprised me) though the order and relative performance generally stayed the same as OpenJDK
native-array and native-testbuf were the fastest options, tying within the margin of error (and under 0.5%) thanks to the optimizer (9.3s/corpus)
native-buffer was always the slowest option. This was always 17-22% slower than the fastest native-array/native-testbuf (11.4s/corpus)
buffer-array and buffer-testbuf were in the middle of the pack within about 1% of each other, but about 4-7% slower than native-array. However, despite the additional array copy they incurred, they were always significantly faster than native-buffer by about 15-17%. (9.7s/corpus)

Two of these results surprised me the most:

That a wrapped byte array being used via the ByteBuffer API (native-buffer) is so slow compared to a custom simple ByteBuffer-like wrapper (native-testbuf)
That making a whole copy of an array (buffer-*) is still so much faster than using the ByteBuffer.wrap object (native-buffer)

I've tried looking around for information on what I might be doing wrong, but most of the performance questions are about native memory and MappedByteBuffers, whereas I am using HeapByteBuffers, as far as I can tell. Why are HeapByteBuffers so slow compared to my re-implementation for trivial read access? Is there some way I can use HeapByteBuffers more efficiently? Does that also apply for MappedByteBuffer?

Update: I've posted the full benchmark, corpus generator, and algorithms at https://gist.github.com/byteit101/84a3ab8f292de404e122562c7008c133 Note that while trying to get the corpus generator to work, I discovered that my 24-bit number was causing a performance penalty, so added a buffer-bufer target, where copying a buffer to a new buffer and using the new buffer is faster than using the origianl buffer after the 24-bit number.

One run on one of my machines with the generated corpus:

Benchmark                  Mode  Cnt  Score   Error  Units
SOBench.t1_native_array      ss   60  0.891 ± 0.018   s/op
SOBench.t2_buffer_testbuf    ss   60  0.899 ± 0.024   s/op
SOBench.t3_buffer_buffer     ss   60  0.935 ± 0.024   s/op
SOBench.t4_native_buffer     ss   60  1.099 ± 0.024   s/op

Some more recent observations: deleting unused code (see comments in gist) makes ByteBuffer as fast as a native array, as does slight tweaks (changing bitmask conditionals for logical comparisons), so my current theory is that it's some inlining cache miss with something offset related too

How to make ByteBuffer as efficient as direct byte[] access after JIT has warmed up?

Answers (1)

Related Questions