Netto
Netto

Reputation: 284

Get GZIPped file attributes (like 'gzip -l', basically compression ratio)

I have a very large tree directory of gzipped files which I need to calculate the uncompressed size. As I'm talking of more than 600GB (compressed), I believe that uncompressing each file to verify the size isn't the right approach.

On Unix shell, I easily achieve this task by using the command gzip -l, listing each file on a folder with compression ratio, compressed and uncompressed size.

Although, the Java libraries I found, related to GZIP, are only Streams for compression and decompression.

If the gzip command can retrieve this information without touching the file, I assume that this data must be specified on some sort of header on the file. What would be the way to access this information without decompressing the file?

Upvotes: 2

Views: 1803

Answers (2)

dkatzel
dkatzel

Reputation: 31658

According to the GZIP spec RFC 1952 the last 4 bytes of a GZIP block is the uncompressed size of the data. This value is stored in little endian. Most gzipped files are only 1 block so that would be the last 4 bytes of a file.

For example, I just gzipped a file whose uncompressed size was 29963246 bytes. The last 4 bytes in the gzip file are

EE 33 C9 01

which when read little endian (right to left) 0x1C933EE = 29963246

Here's a quick and dirty way to get the size of the uncompressed file by only reading the last 4 bytes in little endian:

File f = ...
try(RandomAccessFile ra =new RandomAccessFile(f, "r");
    FileChannel channel = ra.getChannel()){

        MappedByteBuffer fileBuffer = channel.map(MapMode.READ_ONLY, f.length()-4, 4);
        fileBuffer.load();
        
        ByteBuffer buf = ByteBuffer.allocate(4);
        buf.order(ByteOrder.LITTLE_ENDIAN);
        
        
        buf.put(fileBuffer);
        buf.flip();
        //will print the uncompressed size
        //getInt() reads the 4 bytes as a int
        // if the file is between 2GB and 4GB
        // then this will return a negative value
        //and you'll have to do your own converting to an unsigned int
        System.out.println(buf.getInt());
    }

EDIT

Note this only works for a gzipped file of only 1 zipped block (which is most files < 4GB). If you have a file with multiple gzipped blocks, this will only return the size of the last block. Since the spec only allots 4 bytes for the size, I assume a file >4GB will be split into multiple GZIP blocks.

A more robust version would be to parse each gzip block to get the uncompressed size of each block. The GZIP header also has the size of the compressed data so you would have to parse each GZIP block header, get the length of the compressed data, seek that length to get the end of the GZIP block,then get the uncompressed size to sum up. then keep parsing any additional GZIP blocks until you reach EOF.

Upvotes: 3

MJSG
MJSG

Reputation: 1025

Look at Apache Commons Compress, it has support for gzip. It also has a class 'org.apache.commons.compress.compressors.gzip.GzipParameters' that might be of help.

Upvotes: 0

Related Questions