GZIPInputStream.read(...), is reading less number of bytes than supplied length parameter

Question

I have a .gz file which I'm trying to read using GZIPInputStream, fixed number of bytes at a time. I am using GZIPInputStream.read(byte[] buf, int off, int len) (ref: doc) to do this. But the method is reading only 397 bytes(return value of the above method) when the length supplied is 490.

It didn't throw any exception.

I'm wondering in which cases will the return value of the method be less than the supplied length parameter.

What I understood from this question is that some of bytes we want to read might be in the next chunk(which is not uncompressed yet) and we might want to read again (not sure whether this is correct interpretation though). But the documentation of the GZIPInputStream.read(...) doesn't talk about any such chunking.

I uncompressed the .gz file manually and tried reading the uncompressed file using RandomAccessFile.readFully(byte[] b) (ref: doc), which reads all 490 bytes properly.

I'm expecting the GZIPInputStream.read(...) method also to read all 490 bytes properly.

Stephen C · Accepted Answer

I'm going to answer this question in the reverse order that you asked it.

I'm expecting the GZIPInputStream.read(...) method also to read all 490 bytes properly.

While you may expect this, the javadoc for read(buf, off, len) does not state that that will happen. What it actually says is this:

"Reads uncompressed data into an array of bytes. If len is not zero, the method will block until some input can be decompressed; otherwise, no bytes are read and 0 is returned.

Returns: the actual number of bytes read, or -1 if the end of the compressed input stream is reached."

It does NOT state ANYWHERE that read will return as many (available) bytes as will fit.

So, basically, it is your expectation that is wrong. You shouldn't write code that assumes that read will "properly" read all 490 bytes in one call.

I'm wondering in which cases will the return value of the method be less than the supplied length parameter.

The code is complicated. It will be reading from the underlying stream in blocks. Then, when it inflates a stream, it may turn turning a small number of compressed bytes into a large number of uncompressed bytes. So, the InflaterInputStream layer has to deal with cases where advancing the input by one byte results in ... more bytes that will fit in the remainder of the user's buffer.

So (to my mind) it is unsurprising that they would take the simple (and efficient) approach of not entirely fill up the buffer, leaving the unconsumed (compressed) bytes for the next read call¹.

And then there are mysterious "GZIP trailer members" which are dealt with at the GZIPInputStream layer.

Like I said ... it is complicated.

In short, there could be a number of cases where you may not get a number of bytes equal to len. But it won't help you to know all of the details. Especially since they could depend on what version of Java you use!

What you need to know is that is it incorrect for your code to assume that it will get all of the bytes available in a single read call. Even if the byte buffer that you provide is big enough. You already have evidence that that assumption is incorrect.

^{1 - Though it does have to deal with that scenario in the pathological case where you call read with len == 1.}

Sounds like I'm using this stream for the wrong use case here.

No. The use-case is fine. The problem is that your code is using the GZIPInputStream.read method incorrectly.

For what it is worth, this read method is behaving roughly the same way as a read on a SocketInputStream would behave if the "other end" of the socket was writing sporadically. It gives you what is available now.

GZIPInputStream.read(...), is reading less number of bytes than supplied length parameter

Answers (1)

Related Questions