Wanderer
Wanderer

Reputation: 590

Compression ratio for split tarballs

I have a large tarball that was split into several files. The tarball is 100GB split into 12GB files.

tar czf - -T myDirList.txt | split --bytes=12GB - my.tar.gz.

Trying cat my.tar.gz.* | gzip -l returns

 compressed        uncompressed  ratio uncompressed_name
         -1                  -1   0.0% stdout

Trying gzip -l my.tar.gz.aa returns

 compressed        uncompressed  ratio uncompressed_name
12000000000          3488460670 -244.0% my.tar

concatenating the files cat my.tar.gz.* > my.tar.gz returns and even worse answer of

  compressed        uncompressed  ratio uncompressed_name
103614559077          2375907328 -4261.1% my.tar

What is going on here? How can i get the real compression ratio for these split tarballs?

Upvotes: 0

Views: 379

Answers (1)

Mark Adler
Mark Adler

Reputation: 112422

The gzip format stores the uncompressed size as the last four bytes of the stream. gzip -l uses those four bytes and the length of the gzip file to compute a compression ratio. In doing so, gzip seeks to the end of the input to get the last four bytes. Note that four bytes can only represent up to 4 GB - 1.

In your first case, you can't seek on piped input, so gzip gives up and reports -1.

In your second case, gzip is picking up four bytes of compressed data, effectively four random bytes, as the uncompressed size, which is necessarily less than 12,000,000,000, and so a negative compression ratio (expansion) is reported.

In your third case, gzip is getting the actual uncompressed length, but that length modulo 232, which is necessarily much less than 103 GB, reporting an even more significant negative compression ratio.

The second case is hopeless, but the compression ratio for the first and third cases can be determined using pigz, a parallel implementation of gzip that uses multiple cores for compression. pigz -lt decompresses the input without storing it, in order to determine the number of uncompressed bytes directly. (pigz -l is just like gzip -l, and would not work either. You need the t to test, i.e. decompress without saving.)

Upvotes: 1

Related Questions