Compression ratio for split tarballs

Question

I have a large tarball that was split into several files. The tarball is 100GB split into 12GB files.

tar czf - -T myDirList.txt | split --bytes=12GB - my.tar.gz.

Trying cat my.tar.gz.* | gzip -l returns

 compressed        uncompressed  ratio uncompressed_name
         -1                  -1   0.0% stdout

Trying gzip -l my.tar.gz.aa returns

 compressed        uncompressed  ratio uncompressed_name
12000000000          3488460670 -244.0% my.tar

concatenating the files cat my.tar.gz.* > my.tar.gz returns and even worse answer of

  compressed        uncompressed  ratio uncompressed_name
103614559077          2375907328 -4261.1% my.tar

What is going on here? How can i get the real compression ratio for these split tarballs?

Mark Adler · Accepted Answer

The gzip format stores the uncompressed size as the last four bytes of the stream. gzip -l uses those four bytes and the length of the gzip file to compute a compression ratio. In doing so, gzip seeks to the end of the input to get the last four bytes. Note that four bytes can only represent up to 4 GB - 1.

In your first case, you can't seek on piped input, so gzip gives up and reports -1.

In your second case, gzip is picking up four bytes of compressed data, effectively four random bytes, as the uncompressed size, which is necessarily less than 12,000,000,000, and so a negative compression ratio (expansion) is reported.

In your third case, gzip is getting the actual uncompressed length, but that length modulo 2³², which is necessarily much less than 103 GB, reporting an even more significant negative compression ratio.

The second case is hopeless, but the compression ratio for the first and third cases can be determined using pigz, a parallel implementation of gzip that uses multiple cores for compression. pigz -lt decompresses the input without storing it, in order to determine the number of uncompressed bytes directly. (pigz -l is just like gzip -l, and would not work either. You need the t to test, i.e. decompress without saving.)

Compression ratio for split tarballs

Answers (1)

Related Questions