anati
anati

Reputation: 274

Compressing redundant file data

I have a huge ASCII file:

235M Apr 16 06:50 file

I did the below steps:

cat file > file_all

cat file >> file_all

470M Apr 16 06:51 file_all

The size of file_1_2 is 2 * size of file_1 = 470

I used zip compression command to compress file_1 and file_all:

25M Apr 16 06:08 file_all.gz

49M Apr 16 06:25 file_all.gz

Per my understand, the compressing algorithm has below concept:

ZIP compression is based on repetitive patterns in the data to be compressed, and the compression gets better the longer the file is, as more and longer patterns can be found and used.

Question

Why i can't take advantage of repetitions ? Is the 1 Mega the only benefit ?

P.S: I did the same procedure with bz2 and the same note [The difference is only the compressed size itself] Thanks

Upvotes: 1

Views: 424

Answers (2)

Mingye Wang
Mingye Wang

Reputation: 1384

Compressors with long-range lookback used to be restricted to 7z (like Adler mentioned) and less known ones like lrzip. But with zstd becoming mainstream, a typical installation might happen to have the capability too.

To emulate your big ASCII file I used the enwik8 data. I ran the following commands:

cat enwik8 enwik8 > enwik82
zstd enwik8
zstd enwik82
zstd --long enwik8 -o enwik8.long.zst
zstd --long enwik82 -o enwik82.long.zst

And the file sizes are:

100000000   enwik8
35633676    enwik8.long.zst
36445475    enwik8.zip
35626935    enwik8.zst
200000000   enwik82
35644486    enwik82.long.zst
71251491    enwik82.zst

So long range matching worked! (Do note that the default --long window size is 128 M, and you need to ask for --long=28 for a 256 M window.)

Some timing information:

$ time zstd --long enwik82 -f -o enwik82.long.zst
enwik82              : 17.82%   (200000000 => 35644486 bytes, enwik82.long.zst) 

real    0m0.911s
user    0m0.898s
sys 0m0.130s

$ time zstd enwik82 -f -o enwik82.zst
enwik82              : 35.63%   (200000000 => 71251491 bytes, enwik82.zst)     

real    0m1.208s
user    0m1.207s
sys 0m0.162s

Long range matching apparently makes it faster too. The manual says it works okay with multi-threading, but I am too lazy to test right now.

Upvotes: 1

rici
rici

Reputation: 241971

That is indeed the expected outcome.

It is true that the zip compression algorithm depends on finding repeated sequences in the input. However, finding all repetitions would be computationally expensive, both in memory and storage. Maintaining enough information to detect a repetition of a quarter of a gigabyte would be prohibitively expensive and no compressor that I know of even comes close to this size.

Instead, compressors look for repetitions within a sliding window of limited size. In the case of zip (and gzip) this can be configured with a command-line parameter but the largest window is much less than a megabyte. (Highly repetitive inputs, such as files only containing zeros, can be compressed more because the repeated sequences can be compressed in the window itself. But in general, this won't help with long repeated sequences.)

Bzip2 uses a different strategy, but it, too, needs to limit the size of analysed input to avoid excessive runtime. As explained in the bzip2 manual, bzip2 breaks the input into chunks and works on each chunk independently. The default (and maximum) chunk size is 900,000 bytes, which will not allow it to take advantage of multi-megabyte repeated sequences.

Upvotes: 4

Related Questions