Comparative performance of Hadoop Zlib and JDK Gzip

Question

I am doing some benchmarking of single-threaded compression codecs, and the performance I see for Zlib seems significantly higher than what you would expect for a single thread. I have used org.apache.hadoop.io.compress.zlib.ZlibCompressorfor the Zlib compressor implementation, and java.util.zip.Deflate for Gzip implementation to compare with.

Is ZLib compressor (wrapper) provided in Hadoop in some ways multi-threaded, perhaps through JNI interface?

Zlib:

import org.apache.hadoop.io.compress.zlib.*;
protected final zlibCompressor = new ZlibCompressor(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION, ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY, ZlibCompressor.CompressionHeader.DEFAULT_HEADER, DEFAULT_BUFFER_SIZE);
protected final zlibDecompressor = new ZlibDecompressor(ZlibDecompressor.CompressionHeader.DEFAULT_HEADER, DEFAULT_BUFFER_SIZE);

//compress
zlibCompressor.setInput(uncompressed, 0, uncompressed.length);
zlibCompressor.finish();
int n = zlibCompressor.compress(compressBuffer, 0, compressBuffer.length);

//decompress
zlibCompressor.reset();
zlibDecompressor.setInput(compressed, 0, compressed.length);
int n = zlibDecompressor.decompress(uncompressBuffer, 0, uncompressBuffer.length);

Gzip:

import java.util.zip.*;
protected final deflater = new Deflater(COMPRESSION_LEVEL, NO_WRAP);
protected final inflater = new Inflater(NO_WRAP);

//compress
int n = compressBlockUsingStream(uncompressed, compressBuffer);

//decompress
inflater.reset();
int n = uncompressBlockUsingStream(new InflaterInputStream(new ByteArrayInputStream(compressed), _inflater), uncompressBuffer);

Helper funtions for Gzip:

protected int compressBlockUsingStream(byte[] uncompressed, byte[] compressBuffer) throws IOException
{
        ByteArrayOutputStream out = new ByteArrayOutputStream(compressBuffer);
        compressToStream(uncompressed, out);
        return out.length();
}    

protected int uncompressBlockUsingStream(InputStream in, byte[] uncompressBuffer) throws IOException
{
            ByteArrayOutputStream out = new ByteArrayOutputStream(uncompressBuffer);
            byte[] buffer = new byte[4096];
            int count;
            while ((count = in.read(buffer)) >= 0) {
                out.write(buffer, 0, count);
            }
            in.close();
            out.close();
            return out.length();
}

Throughput:

Zlib/block -- 143.902 MBps

Gzip/JDK/stream -- 22.573 MBps

Anyone has an idea why zlib is so much faster (using all cores natively)? The code is expected to run single-threaded. Anyone is able to replicate similar result?

Comparative performance of Hadoop Zlib and JDK Gzip

Answers (1)

Related Questions