Caharpuka
Caharpuka

Reputation: 153

gzip -l returning incorrect values for uncompressed file size

I am trying to quickly assess the line number of gzipped files. I do this by checking the uncompressed size of the file, sampling lines from the beginning of the file with zcat filename | head -n 100 (for instance), and dividing the uncompressed size by the average line size of this sample of 100 lines.

The problem is that the data I'm receiving from gzip -l is invalid. Mostly it seems the uncompressed size is too small, in some cases producing negative compression values. For example, in one case the compressed file is 1.8gb, and the uncompressed is listed as 0.7gb by gzip -l, when it is actually 9gb when decompressed. I tried to decompress and recompress but still get the same uncompressed size.

gzip 1.6 on ubuntu 18.04.3

Upvotes: 5

Views: 1707

Answers (1)

pmqs
pmqs

Reputation: 3725

Below is the part of the gzip spec (RFC 1952) where it defines how the uncompressed size is stored in the gzip file.

ISIZE (Input SIZE)
    This contains the size of the original (uncompressed) input
    data modulo 2^32.

You are working with a gzip archive where the uncompressed size is > 2^32, so the uncompressed size reported by gzip -l is always going to be incorrect.

Note that this design limitation in the gzip file format doesn't cause any problems when uncompressing the archive. The only impact is with gzip -l or gunzip -l

Upvotes: 9

Related Questions