byteit101
byteit101

Reputation: 4010

How to make ByteBuffer as efficient as direct byte[] access after JIT has warmed up?

I am trying to optimize a simple decompression routine, and came across this weird performance quirk that I can't seem to find much information on: Manually implemented trivial byte buffers are 10%-20% faster than the built in byte buffers (heap & mapped) for trivial operations (read one byte, read n bytes, is it end of stream)

I tested 3 APIs:

The trivial wrapper:

class TestBuf {
    private final byte[] ary;
    private int pos = 0;

    public TestBuf(ByteBuffer buffer) {  // ctor #1
        ary = new byte[buffer.remaining()];
        buffer.get(ary);
    }
    
    public TestBuf(byte[] inAry) { // ctor #2
        ary = inAry;
    }

    public int readUByte() { return ary[pos++] & 0xFF; }

    public boolean hasRemaining() { return pos < ary.length; }

    public void get(byte[] out, int offset, int length) {
        System.arraycopy(ary, pos, out, offset, length);
        pos += length;
    }
}

The stripped-down core of my main loop is roughly a pattern of:

while (buffer.hasRemaining()) {
    int op = buffer.readUByte();
    if (op == 1) {
        int size = buffer.readUByte();
        buffer.get(outputArray, outputPos, size);
        outputPos += size;
    } // ...
}

I tested the following combos:

I used JMH (blackholing each outputArray), and tested Java 17 on OpenJDK and GraalVM with a decompression corpus of ~5GiB preloaded into RAM, containing ~150,000 items averaging in size from 2KiB to 15MiB. Each corpus took ~10sec to decompress, and the JMH runs had proper warmup and iterations. I did strip the tests down the the minimal necessary non-array code, but even benchmarking the original code this came, from the difference is nearly the same percentage (ie. I don't think there is much else beyond the buffer/array acesses controlling the performance of my original code)

Across several computers the results were a bit jittery, but relatively consistent:

Two of these results surprised me the most:

I've tried looking around for information on what I might be doing wrong, but most of the performance questions are about native memory and MappedByteBuffers, whereas I am using HeapByteBuffers, as far as I can tell. Why are HeapByteBuffers so slow compared to my re-implementation for trivial read access? Is there some way I can use HeapByteBuffers more efficiently? Does that also apply for MappedByteBuffer?

Update: I've posted the full benchmark, corpus generator, and algorithms at https://gist.github.com/byteit101/84a3ab8f292de404e122562c7008c133 Note that while trying to get the corpus generator to work, I discovered that my 24-bit number was causing a performance penalty, so added a buffer-bufer target, where copying a buffer to a new buffer and using the new buffer is faster than using the origianl buffer after the 24-bit number.

One run on one of my machines with the generated corpus:

Benchmark                  Mode  Cnt  Score   Error  Units
SOBench.t1_native_array      ss   60  0.891 ± 0.018   s/op
SOBench.t2_buffer_testbuf    ss   60  0.899 ± 0.024   s/op
SOBench.t3_buffer_buffer     ss   60  0.935 ± 0.024   s/op
SOBench.t4_native_buffer     ss   60  1.099 ± 0.024   s/op

Some more recent observations: deleting unused code (see comments in gist) makes ByteBuffer as fast as a native array, as does slight tweaks (changing bitmask conditionals for logical comparisons), so my current theory is that it's some inlining cache miss with something offset related too

Upvotes: 5

Views: 983

Answers (1)

Stef
Stef

Reputation: 21

I think there is a regression with Java 17. I'm using a lib which processes a String. Several times new copies through #String.Split or #String.getBytes are being created. So I tried out an alternative implementation with ByteBuffer.

With Java 11 this solution is round about 30% faster then the original String based version.

time: 129 vs 180 ns/op
gc.alloc.rate: 2931 vs 3861 MB/sec
gc.count: 300 vs 323
gc.time: 172 vs 178 ms

With Java 17 it changed. The ByteBuffer version deteriorated, the String version improved.

time: 143 vs 146 ns/op
gc.alloc.rate: 2889 vs 4781 MB/sec
gc.count: 426 vs 586
gc.time: 240 vs 305 ms

Even gc.count and gc.time increased.

Upvotes: 2

Related Questions