Compressing redundant file data

Question

I have a huge ASCII file:

235M Apr 16 06:50 file

I did the below steps:

cat file > file_all

cat file >> file_all

470M Apr 16 06:51 file_all

The size of file_1_2 is 2 * size of file_1 = 470

I used zip compression command to compress file_1 and file_all:

25M Apr 16 06:08 file_all.gz

49M Apr 16 06:25 file_all.gz

Per my understand, the compressing algorithm has below concept:

ZIP compression is based on repetitive patterns in the data to be compressed, and the compression gets better the longer the file is, as more and longer patterns can be found and used.

Question

Why i can't take advantage of repetitions ? Is the 1 Mega the only benefit ?

P.S: I did the same procedure with bz2 and the same note [The difference is only the compressed size itself] Thanks

rici · Accepted Answer

That is indeed the expected outcome.

It is true that the zip compression algorithm depends on finding repeated sequences in the input. However, finding all repetitions would be computationally expensive, both in memory and storage. Maintaining enough information to detect a repetition of a quarter of a gigabyte would be prohibitively expensive and no compressor that I know of even comes close to this size.

Instead, compressors look for repetitions within a sliding window of limited size. In the case of zip (and gzip) this can be configured with a command-line parameter but the largest window is much less than a megabyte. (Highly repetitive inputs, such as files only containing zeros, can be compressed more because the repeated sequences can be compressed in the window itself. But in general, this won't help with long repeated sequences.)

Bzip2 uses a different strategy, but it, too, needs to limit the size of analysed input to avoid excessive runtime. As explained in the bzip2 manual, bzip2 breaks the input into chunks and works on each chunk independently. The default (and maximum) chunk size is 900,000 bytes, which will not allow it to take advantage of multi-megabyte repeated sequences.

Compressing redundant file data

Answers (2)

Related Questions