Reputation: 590
I have a large tarball that was split into several files. The tarball is 100GB split into 12GB files.
tar czf - -T myDirList.txt | split --bytes=12GB - my.tar.gz.
Trying cat my.tar.gz.* | gzip -l
returns
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout
Trying gzip -l my.tar.gz.aa
returns
compressed uncompressed ratio uncompressed_name
12000000000 3488460670 -244.0% my.tar
concatenating the files cat my.tar.gz.* > my.tar.gz
returns and even worse answer of
compressed uncompressed ratio uncompressed_name
103614559077 2375907328 -4261.1% my.tar
What is going on here? How can i get the real compression ratio for these split tarballs?
Upvotes: 0
Views: 379
Reputation: 112422
The gzip format stores the uncompressed size as the last four bytes of the stream. gzip -l
uses those four bytes and the length of the gzip file to compute a compression ratio. In doing so, gzip seeks to the end of the input to get the last four bytes. Note that four bytes can only represent up to 4 GB - 1.
In your first case, you can't seek on piped input, so gzip gives up and reports -1.
In your second case, gzip is picking up four bytes of compressed data, effectively four random bytes, as the uncompressed size, which is necessarily less than 12,000,000,000, and so a negative compression ratio (expansion) is reported.
In your third case, gzip is getting the actual uncompressed length, but that length modulo 232, which is necessarily much less than 103 GB, reporting an even more significant negative compression ratio.
The second case is hopeless, but the compression ratio for the first and third cases can be determined using pigz, a parallel implementation of gzip that uses multiple cores for compression. pigz -lt
decompresses the input without storing it, in order to determine the number of uncompressed bytes directly. (pigz -l
is just like gzip -l
, and would not work either. You need the t
to test, i.e. decompress without saving.)
Upvotes: 1