GZIPInputStream padding zeroes at end?

Question

I'm experiencing a strange issue with unzipping a file and I'm considering to use charset UTF-8 for that. I'm using the Guava library.

public static byte[] gzip(final CharSequence cs, final Charset charset) throws IOException {
    final ByteArrayOutputStream os = new ByteArrayOutputStream(cs.length());
    final GZIPOutputStream gzipOs = new GZIPOutputStream(os);
    gzipOs.write(charset.encode(CharBuffer.wrap(cs)).array());
    Closeables.closeQuietly(gzipOs);
    return os.toByteArray();
}

public static boolean gzipToFile(final CharSequence from, final File to, final Charset charset) {
    try {
        Files.write(StreamUtils.gzip(from, charset), to);
        return true;
    } catch (final IOException e) {
        // ignore
    }
    return false;
}

public static String gunzipFromFile(final File from, final Charset charset) {
    String str = null;
    try {
        str = charset.decode(ByteBuffer.wrap(gunzip(Files.toByteArray(from)))).toString();
    } catch (final IOException e) {
        // ignore
    }
    return str;
}

public static byte[] gunzip(final byte[] b) throws IOException {
    GZIPInputStream gzipIs = null;
    final byte[] bytes;
    try {
        gzipIs = new GZIPInputStream(new ByteArrayInputStream(b));
        bytes = ByteStreams.toByteArray(gzipIs);
    } finally {
        Closeables.closeQuietly(gzipIs);
    }
    return bytes;
}

And here a small JUnit. For testing I'm using a lorem ipsum with different languages like English, German, Russian, ... and I'm compressing the orignal text to a file first, then uncompress the file and compare it with original text:

@Test
public void gzip() throws IOException {
    final String originalText = Files.toString(ORIGINAL_IPSUM_LOREM, Charsets.UTF_8);

    // create temporary file
    final File tmpFile = this.tmpFolder.newFile("loremIpsum.txt.gz");

    // check if gzip write is OK
    final boolean status = StreamUtils.gzipToFile(originalText, tmpFile, Charsets.UTF_8);
    Assertions.assertThat(status).isTrue();
    Assertions.assertThat(Files.toByteArray(tmpFile)).isEqualTo(Files.toByteArray(GZIPPED_IPSUM_LOREM));

    // unzip it again
    final String uncompressedString = StreamUtils.gunzipFromFile(tmpFile, Charsets.UTF_8);
    Assertions.assertThat(uncompressedString).isEqualTo(originalText);
}

And the JUnit fails with following: enter image description here

Debugger shows a difference between uncompressedText and orignalText:

[-17, -69, -65, 76, 111, ... (omitted) ... , -117, 32, -48, -66, -48, -76, -47, -128, 32, -48, -78, -48, -75, -47, -127, 46, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... (omitted) ... , 0, 0, 0, 0]

.. and the originalText is without tailing zeroes:

[-17, -69, -65, 76, 111, ... (omitted) ... , -117, 32, -48, -66, -48, -76, -47, -128, 32, -48, -78, -48, -75, -47, -127, 46]

Any idea what might be wrong??? Thank you :-)

Stephen C · Accepted Answer

I think that the problem is here:

    charset.encode(CharBuffer.wrap(cs)).array()

The javadoc for array() says that it returns the backing array for the ByteBuffer. But the backing array could be bigger than the valid content of the buffer .... and I suspect that in this case it is.

FWIW ... I doubt that explicit user of Buffer objects, and ByteArray stream objects is helping performance much.

I suspect that you would be better off doing just this:

public static boolean gzipToFile(CharSequence from, File to, Charset charset) {
    try (FileOutputStream fos = new FileOutputStream(to);
         BufferedOutputStream bos = new BufferedOutputStream(fos);
         GZIPOutputStream gzos = new GZIPOutputStream(bos);
         OutputStreamWriter w = new OutputStreamWriter(gzos, charset)) {
        w.append(from);
        w.close();
        return true;
    } catch (final IOException e) {
        // ignore
    }
    return false;
}

(And the equivalent to read.)

Why? I suspect that the extra copy to the intermediate ByteArray streams is most likely negating the potential speed-up you gain by using a Buffer.

And besides, my gut feeling is that the compression / decompression steps are going to dominate anything else.

GZIPInputStream padding zeroes at end?

Answers (1)

Related Questions