Reputation: 274
I have a huge ASCII file:
235M Apr 16 06:50 file
I did the below steps:
cat file > file_all
cat file >> file_all
470M Apr 16 06:51 file_all
The size of file_1_2 is 2 * size of file_1 = 470
I used zip compression command to compress file_1 and file_all:
25M Apr 16 06:08 file_all.gz
49M Apr 16 06:25 file_all.gz
Per my understand, the compressing algorithm has below concept:
ZIP compression is based on repetitive patterns in the data to be compressed, and the compression gets better the longer the file is, as more and longer patterns can be found and used.
Question
Why i can't take advantage of repetitions ? Is the 1 Mega the only benefit ?
P.S: I did the same procedure with bz2 and the same note [The difference is only the compressed size itself] Thanks
Upvotes: 1
Views: 424
Reputation: 1384
Compressors with long-range lookback used to be restricted to 7z (like Adler mentioned) and less known ones like lrzip. But with zstd becoming mainstream, a typical installation might happen to have the capability too.
To emulate your big ASCII file I used the enwik8 data. I ran the following commands:
cat enwik8 enwik8 > enwik82
zstd enwik8
zstd enwik82
zstd --long enwik8 -o enwik8.long.zst
zstd --long enwik82 -o enwik82.long.zst
And the file sizes are:
100000000 enwik8
35633676 enwik8.long.zst
36445475 enwik8.zip
35626935 enwik8.zst
200000000 enwik82
35644486 enwik82.long.zst
71251491 enwik82.zst
So long range matching worked! (Do note that the default --long
window size is 128 M, and you need to ask for --long=28
for a 256 M window.)
Some timing information:
$ time zstd --long enwik82 -f -o enwik82.long.zst
enwik82 : 17.82% (200000000 => 35644486 bytes, enwik82.long.zst)
real 0m0.911s
user 0m0.898s
sys 0m0.130s
$ time zstd enwik82 -f -o enwik82.zst
enwik82 : 35.63% (200000000 => 71251491 bytes, enwik82.zst)
real 0m1.208s
user 0m1.207s
sys 0m0.162s
Long range matching apparently makes it faster too. The manual says it works okay with multi-threading, but I am too lazy to test right now.
Upvotes: 1
Reputation: 241971
That is indeed the expected outcome.
It is true that the zip compression algorithm depends on finding repeated sequences in the input. However, finding all repetitions would be computationally expensive, both in memory and storage. Maintaining enough information to detect a repetition of a quarter of a gigabyte would be prohibitively expensive and no compressor that I know of even comes close to this size.
Instead, compressors look for repetitions within a sliding window of limited size. In the case of zip (and gzip) this can be configured with a command-line parameter but the largest window is much less than a megabyte. (Highly repetitive inputs, such as files only containing zeros, can be compressed more because the repeated sequences can be compressed in the window itself. But in general, this won't help with long repeated sequences.)
Bzip2 uses a different strategy, but it, too, needs to limit the size of analysed input to avoid excessive runtime. As explained in the bzip2 manual, bzip2 breaks the input into chunks and works on each chunk independently. The default (and maximum) chunk size is 900,000 bytes, which will not allow it to take advantage of multi-megabyte repeated sequences.
Upvotes: 4