Reputation: 113
I know that is impossible to compress a file infinitely by repeatedly applying one algorithm.
But is it possible to keep compressing a file by repeatedly applying different algorithms, no matter what structure of the file is?
Upvotes: 1
Views: 510
Reputation: 65046
Yes, you can - to the point where the file eventually becomes empty - but of course you'd just be moving the entropy from the file itself into the choice of compression algorithm (and into the record of which compression algorithm was chosen).
For example, let me define the following compressor, which looks at the first two bits in the file, and encodes those bits and then outputs the remainder of the file unchanged.
Algorithm
compress00
:If the first two bits in the file are
00
, replace those two bits with a single 0 bit - in this case, the file is 1 bit shorter. Otherwise, prepend a 1 to the file - in this case, the file is 1 bit longer.
This algorithm is easily invertible (i.e., the output can be unambiguously decompressed) and it either shortens or lengthens the file by 1 bit. Imagine a family of 4 compressors: compress00, compress01, compress10, compress11
, which all have the same behavior except that, for example, compress01
shortens the file if it starts with 01
and lengthens it otherwise, and so on for the other compressors.
Note now that every file of at least two bits either starts with 00, 01, 10, or 11 - so at every stage in your process of repeated compression you can choose the algorithm that compresses the file by 1 bit. Repeated application of this process will reduce the file down to a single bit (and you can define some additional behavior for 1 bit files to get down to 0 bits).
Of course, a very minor downside of this method is that to effectively decompress the you need to remember which compressors you chose at each step.
Upvotes: 2
Reputation: 44146
No.
The unit we use to "weight" a file is the byte.
The byte is just a multiple of the more fundamental unit bit.
Though the word "bit" comes from "Binary digit", it's actually a measurement unit, just like meters, joules, seconds and so on.
A bit measure the minimum quantity of information measurable, just like e is the smallest charge.
When we say that a file has size 2 MiB we are saying that there are 220 + 3 = 8.388.608 bits of information.
Now if you want to go up an hill 20 m tall and you weight 50 Kg you need at least E = mgh = 50 · 9.81 · 20 = 9810 J.
No matter what you do, you need at least that energy or you won't get there1.
Also you can have more energy, 9810 J is the minimum.
The same thing apply to information. A file carries a message that need at least X bits of information to be understood unambiguously.
Most files contains more that the minimal quantity of information because files are made to be easy to process.
This is like when in English a person says "I am going out" versus "going out". Both gives the same message but one is more easy to process yet longer.
So intuitively we can remove the redundancy of a file, thereby reducing its size, until we reach the minimum X.
Keeping on removing bits after we reached the minimum would mean removing useful information, preventing the original file from being recovered (read: unzipped).
This is actually done with lossy algorithms, for example MP3, JPEG and so on.
This intuition that strings cannot be infinitely compressed is easily proven.
We follow the approach of Chapter 6.4 of Introduction to Theory of Computation by Sipser.
We assign a weight to a string s in this way: we consider all the algorithms A that when processing a, new, string Ma output s.
We encode each A and Ma as a string and we set K(s) as the length of the shortest of such string.
K(s) is called the minimal descriptor of s and represent the smallest information needed to generate s.
If K(s) is less than the length s than s is said compressible.
We now show that there are non-compressible strings.
Suppose s has length n. There are 2n possible such strings.
If s is compressible then it has a minimal descriptor of length at most n-1.
There are 1 + 2 + 4 + 8 + ... + 2n-1 = 2n - 1 such descriptors.
Since each descriptor defines a string uniquely and there are less descriptors than strings of length n, some string of length n is uncompressible.
By the arbitrarily of n uncompressible strings of any length exists.
So in short, if we keep compressing we eventually reach an uncompressible file.
In practice a good compression algorithm should remove most of the redundancy at the first step without adding a significant amount of information.
That's why zipping a jpeg, an already compressed format, doesn't produce the same result as zipping a text file.
Also this explain why encrypted files, that appear to be random, so without any redundancy, cannot be compressed very well.
1 We deal only with simple Newtonian physic here.
Upvotes: 2
Reputation: 5719
You can but it does not make any difference. Most of the compression algorithms operate based on reducing the redundancy, however some are more efficient and therefore slower.
Now when you apply the first algorithm it reduces those redundancies, so applying the second does not make much difference.
Check here for more info.
Upvotes: 0