Reputation: 620

GZIPInputStream: Read first n bytes from decompressed file

I have a set of thousands of GZIP files which I'm accessing through HTTP. Each file may be up to few hundreds of MB in size. I need to read first few kilobytes (header) from a file inside these compressed files.

This is my current approach:

URL url = new URL("http://example.com/file123.gz");
DataInputStream ds = new DataInputStream(new GZIPInputStream(url.openStream()));
byte[] header = new byte[5760];
ds.readFully(header);

What I need to do is to download first 5760 bytes from the file inside this GZIP file, but I do not want Java to download the whole file (which is usually more than few MB).

My question is - does Java first download the whole GZIP file and then decompress it, or does it download just the necessary amount of data to fill the byte[5760] buffer? How can I find how much data was actually downloaded from the HTTP server?

Upvotes: 1

Answers (3)

Stephen C

Reputation: 718698

Does Java first download the whole GZIP file and then decompress it, or does it download just the necessary amount of data to fill the byte[5760] buffer?

It is closer to that latter. Java does not read the entire file first. Instead, url.openStream() gives you a "socket stream" that reads data directly from the socket.

There is likely to be some data buffered in the kernel-side socket data structures, and possibly more in the GZIPInputStream. But it is definitely a bounded amount. So it is likely, that the server will send more data than your application actually consumes, but it is unlikely that it will send entire (megabyte-sized) files.

How can I find how much data was actually downloaded from the HTTP server?

It is difficult to measure, and indeed even difficult to define. Based on the context, it seems that you are really interested in how much the server sends. The only practical way to measure that is on the server side, and even that is difficult. (If you don't really need to find this out, I recommend that you don't bother trying ...)

Upvotes: 2

Ian Roberts

Reputation: 122364

If the web server supports byte-range requests then you may be able to tell it to download just the first (say) 10kB of compressed data (to ensure you get at least 5760 bytes when you decompress it)

URL url = new URL("http://example.com/file123.gz");
URLConnection connection = url.openConnection();
connection.setRequestProperty("Range", "bytes=0-9999");
DataInputStream ds = new DataInputStream(
                         new GZIPInputStream(connection.getInputStream()));
byte[] header = new byte[5760];
ds.readFully(header);

You may need to catch any exceptions thrown in this process and retry without the range header (though a server that doesn't understand it ought to just send the whole file anyway).

Upvotes: 0

f1sh

Reputation: 11934

You cant specify how much data will actually be downloaded.

The webserver that serves your request will open the requested file and send the whole content (preceded by the http response headers) through the tcp connection.

That means that the whole file will be sent to you and you can't do anything about it except to close the underlying connection at just the right time, but that won't be easy to do and especially not work reliably. That means: you read the 5760 bytes from the inputstream (which, at this point, already contains more than those 5760 bytes!) and then close the stream and the connection - but that doesnt meant a whole lot more data was received in the meantime

To find out how much you actually received, you have to read your inputstream completely and check it's length.

Upvotes: 0

GZIPInputStream: Read first n bytes from decompressed file

Answers (3)

Related Questions