Reputation: 31

Sequence Compression?

lately i've faced a problem which gets me so confused , the problem is : i want to compress a sequence so no information is lost , for example :

a,a,a,b --> a,b

a,b,a,a,c --> a,b,a,a,c (it can't be compressed to a,b,a,c because in this way we lose a,a)

Is there any algorithm to do such a thing ? what is name of this problem ? is it compression ? or anything else ? I would really appreciate any help Thanks in advance

Upvotes: 3

Answers (7)

Massimo

Reputation: 3460

Another good algorithm is Lempel–Ziv–Welch

I found marvellous this simple Javascript LZW function, from the magicians at 140 bytes of javascript :

function compress(a){
    for (
        var b = a + "Ā", // Append first "illegal" character (charCode === 256).
            c = [], // dictionary
            d = 0, // dictionary size
            e = d, // iterator
            f = c, // w
            g = c, // result
            h; // c

        h = b.charAt(e++);
    )

        c[h] = h.charCodeAt(), // Fill in the dictionary ...
        f = 1 + c[a = f + h] ? a : (g[d++] = c[f], c[a] = d + 255, h); 

    return g 
}

console.log(compress("aaabbcccccb"))

Upvotes: 3

Aida

Reputation: 1

Compression limit of an individual sequence

What is the compression limit of a sequence?

Two answers can be given to this question:

The compression limit of a sequence is an unsolved problem. Furthermore, it has been shown that this limit cannot be calculated.
Shannon defined a subproblem of great practical relevance, in which it is possible to define a theoretical limit (source coding theorem). This subproblem is purely probabilistic and is called source coding.

Definition of the information transfer problem in its most general form

To understand these two answers it is essential to introduce the problem of information transmission in its most general form. For the transmission of information to take place between two points, it is essential that the two subjects share a communication language. This requirement is necessary, without a language of communication, the transmitted message is not understood and therefore does not carry any information. Consequently, the information transmission problem, in its most general form, studies the transfer of information in any form between two points that share a language of communication.

The problem with this definition is that the concept of "information in any form" is difficult to formalize mathematically. For this reason, it is essential to introduce a first simplification of the problem, in which the information is defined as a numerical sequence. In this way, a subproblem is obtained, in which the concept of information is well-defined and can be formalized from the mathematical point of view.

Consequently, the two elements necessary for the transfer of information are: a compressed sequence which carries the information and a communication language known by both the encoder and the decoder.

Compressed sequence: represents any information derived from the sequence to be transmitted. From a practical point of view, any information obtained from the sequence that is useful for the decoder to go back to the original message must be considered part of the compressed message.

Communication language: represents a set of rules that allows communication and therefore the transfer of information between two subjects. From a practical point of view, any information present before the generation of the message can be considered part of the communication language.

Source coding

Shannon in his famous article “A Mathematical Theory of Communication” introduces a further modification to the information transfer problem, which allows him to obtain a subproblem in which it is possible to define the theoretical compression limit. For this purpose, Shannon introduces the source that generates the message. Furthermore, the coding scheme which converts the symbols into codewords is generated on the information of the source and not on the information obtained from the message. Therefore, the coding scheme being defined before the generation of the message can be considered as an integral part of the communication language. Then, Shannon formalizes the concept of information, for this purpose he develops a variable that uniquely defines the message transmitted in the presence of an encoding scheme. The solution developed by Shannon is to consider information as a measure of a variation. The function generated that performs this measurement is called entropy H. The following sentence, taken from the article "A Mathematical Theory of Communication”, makes us understand Shannon's idea of entropy.

“In the limiting case where one probability is unity (certainty) and all the others zero (impossibility), then H is zero (no uncertainty at all- no freedom of choice - no information).”

With regard to the subproblem he defined, Shannon manages to define a theoretical compression limit by developing the source coding theorem. This theorem proves that entropy defines the minimum average length of the codewords that replace the symbols in the compressed message. Thus, given a source X of random variables i.i.d. which generates a message of length N the compressed message cannot be longer than NH(X). The encoding scheme being defined before message generation is not part of the compressed message.

Application of the theory developed by Shannon when the source is not known When the source is not known, it is possible to calculate the zero order empirical entropy H0(m), in which the frequencies are obtained from the message m to be transmitted. In this case, using the frequencies of the symbols in the sequence the coding scheme is developed after the message has been generated therefore it must be part of the compressed message. Consequently, the compressed message is defined by NH0(M)+encoding scheme. Being the value of H0(m) very close to the value of H(X) the coding scheme represents a redundancy, which is also called inefficiency of the entropy coding. In practice, the non-knowledge of the source that generates the message determines the presence of a redundancy in the compressed message. Furthermore, the length of the coding scheme depends on the language used, therefore, it is not possible to define it precisely. Consequently, it’s not possible to define a limit on the length of the compressed message. Therefore, using the formalism introduced by Shannon, we have shown that the compression limit of a sequence whose source is unknown is not computable.

Difference between the source entropy H(X) and the zero order empirical entropy H0(m)

The value of NH(X) represents the compressed message. So entropy, in this case, defines message information in a more general sense, not just Shannon information. Unfortunately, this method does not represent the general case of information transmission, but represents a subproblem in which both the encoder and the decoder know the source that generates the message.

The value of NH0(m) does not represent the compressed message. Therefore, the zero-order empirical entropy represents only the average value of the Shannon information of the single symbol. Consequently, the zero-order empirical entropy has a less general meaning than the entropy of the source. However, in this case, the problem of information transmission is being addressed in a much more general way than in the case of source coding.

To demonstrate what has just been said, the following example is given: let us take a uniform source X of random variables i.i.d. which generates a message of length N. Of course, this sequence cannot be compressed. Thus, on average the compressed message must have a length greater than or equal to N. In the first case, the encoder and the decoder know the source and therefore we can apply the source encoding. The result obtained is that all the messages have length NH(X) and since H(X)=1 (we use the dimension of the alphabet as the base of the entropy) the compressed message has a length N. So, in this case , we reach the theoretical compression limit. In the second case,both the encoder and the decoder do not know the source. Consequently, the encoder must use the value of the frequencies of the symbols in the message to calculate the entropy. In this way, the zero order empirical entropy H0(m) is obtained. The source is uniform but only a small amount of the generated messages have uniform symbol distribution. Therefore, the average value of the zero-order empirical entropy H0(m) of the messages will be less than H(X). Consequently, if we do not consider the coding scheme as part of the compressed message, we obtain an illogical result, in fact, we have NH0(m)<N. In practice, we managed to compress a random sequence. The error depends on the fact that, in this second case, we obtained the coding scheme after the message was generated therefore, it must be part of the compressed message. In conclusion, to obtain a correct result, the compressed string must be defined by NH0(m)+coding scheme whose length is greater than N. Therefore, we have a redundancy due to non-knowledge of the source.

Modern approach to information theory

One of the most important aspects covered is to understand that source coding is a subproblem with respect to the problem of information transmission in its most general form. Consequently, when the source is not known, only the zero order empirical entropy can be calculated, a parameter that has a much more limited value than the entropy of the source. Indeed, the value obtained by multiplying the zero-order empirical entropy by the message length NH0(m) does not represent the compressed message. In this case, the coding scheme must also be part of the compressed message, having been defined after the generation of the message and therefore cannot be included in the communication language. This approach represents an important change of point of view. In fact, it was absolutely normal, even when the source was not known, to consider the compressed message only as NH0(m) without considering the coding scheme as is done with source coding.

This change of point of view was necessary with the arrival of a new theory that makes it possible to reduce the redundancy that is generated when the source is not known. This theory can only be applied when the alphabet is greater than or equal to 3. The purpose of this theory is to make the encoded sequence NH0(m) and the coding scheme interact with each other more efficiently.

Now, I will explain the basis of this theory. If we have to perform entropy coding on a sequence of length N and alphabet A, whose source we do not know, the only information we have is that the sequence will be part of one of |A|^N possible sequences. Thus, the only possible way for a transform, to improve on average the length of the compressed sequence (encoded sequence+coding scheme), using an entropic coding, is to transform the set of all possible sequences into a new set of the same size composed of sequences that on average can be compressed in less space. In this way, even if we don't know the source that generates the sequence, we know that if we apply this transform, we can, on average, obtain a compressed message of smaller length than the compressed message of the untransformed sequence.

The set having these characteristics is the set of dimension |A|^N composed of sequences of length N+K. Increasing the length of the sequence also increases the size of the set that includes all possible sequences, which becomes |A|^N+K. Therefore, from this set it is possible to select a subset of size |A|^N composed of sequences having the smallest value of the zero order empirical entropy .

In this way, we obtain the following result: given a message m of length N generated by a source X (unknown) of random variables i.i.d., applying the described transform f(m), we obtain on average:

NH(X)< Avg(NtH0(f(m)+ coding scheme)< Avg(NH0(m)+coding scheme)

With Nt>N

Avg(NH0(m)+coding scheme)=mean value of the encoded sequence m+coding scheme

Avg(NtH0(f(m)+ coding scheme)=mean value of the encoded transformed sequence f(m)+coding scheme

NH(X) is the compression limit that is not known, in fact, we have set the condition that the source is not known. As mentioned at the beginning, when the source is not known, in the current state of knowledge, it is not possible to define a compression limit.

Now, we prove the obtained result experimentally. Let's take a sequence of length 3 and alphabet 3, the possible sequences are 27. If we increase the length to 4 and keep the alphabet at 3, the possible sequences become 81. Of these 81, we select the 27 with the smallest value of H0(m). In this way, the two sets have the same number of elements. Thus, we can define a one-to-one relationship between the sequences of the two sets. The following table shows the 27 elements of the two sets. In the first column, we have message m, in the second column, we have NH0(m), in the third column, we have the transformed message f(m) and in the fourth column, we have NtH0(f(m)) of the transformed message f(m).

enter image description here

The average value of zero order empirical entropy multiplied by the string length N=3 of the messages m is: Avg(NH0(m))=2.893 with N=3. The average value of zero order empirical entropy multiplied by the length of the string Nt=4 of the transformed messages f(m) is: Avg(NtH0(f(m))=2.884 con Nt=4 Thus, you can see that although the transformed messages are longer, the average value of zero-order empirical entropy multiplied by the message length is, on average, smaller when the transform is applied. Consequently, even if we do not know the source, we know that by applying the transform, we obtain, on average, a reduction in the value of zero-order empirical entropy multiplied by the length of the message. Therefore, the length of the encoded message (symbols replaced by codewords) decreases.

Now, let's look at the encoding scheme, in this case making an evaluation is more difficult because the length of the encoding scheme depends on the compression method used. But, we know that, from a theoretical point of view, the parameter that most influences the length of the encoding scheme is the size of the alphabet. Therefore, we calculate the average size of the alphabet A in the two sets. The mean value of the alphabet size |A|, in the case of messages in the first column of the table, is: Avg(|A|)=2.1 The mean value of the alphabet size |A|, in the case of transformed messages f(m) in the third column of the table, is: Avg(|A|)=1.89 The result obtained is not valid only for the reported case. Indeed, whatever the size of the alphabet, the transformed set being composed of longer sequences always has a greater number of type classes with a number of symbols less than |A|. With type class, we mean a set of strings that all have the same symbol frequency, for example, string 113 and string 131 belong to the same type class in fact, both have symbol frequency 1=2/3 and 3=1/ 4.

Therefore, being on average NH0(f(m))<NH0(m) and coding scheme f(m) < coding scheme m, we have experimentally demonstrated the following inequality:

NH(X)< Avg(NH0(f(m)+ coding scheme)< Avg(NtH0(m)+coding scheme)

In this way, the set of transformed messages has an average compressed message length, using entropy encoding, less than the set of untransformed messages. Consequently, when the source is not known, if the described transform is applied, a reduction of the compressed message is obtained on average. Therefore, we are able to reduce the inefficiency caused by not knowing the source.

In conclusion, the question we posed at the beginning, about the existence of a minimum compression length of a sequence, represents a problem that has not yet been solved. However, new developments regarding the theory of information, managing to reduce the inefficiency that is created when we don’t know the source, have made it possible to take a significant step forward on this topic.

Upvotes: 0

codymanix

Reputation: 29468

Every algorithm which is able to transform data in such a way that is takes up less memory is called compression. May it be lossless or lossy.

e.g. (compressed form for "example given" :-) )

The following is imho the simplest form, called run length encoding, in short RLE:

a,a,a,b,c -> 3a,1b,1c

As you can see all subsequent characters which are identical are compressed into one.

You can also search for subsequent patterns which is much more difficult:

a,b,a,b,a,c --> 2(a,b),1(a),1(c)

There are lots of literature and web sources about compression algorithms, you should use them to get a deeper view.

Upvotes: 3

Praveen Kishor

Reputation: 2523

We can use the LZW compression algorithm to compress text files efficiently and quickly by making use of hash tables.

Upvotes: 0

Alexander Rafferty

Reputation: 6233

Yep, compression. A simple algorithm would be runlength encoding. There also information theory, which is the basis for compression algorithms.

Information theory: More common inputs should be shorter, thus making the sentence length shorter.

So, if you're encoding binary, where the sequence 0101 is very commmon (about 25% of the input), then a simple compression would be:

0101 = 0
anything else = 1[original 4 bits]

So the input: 0101 1100 0101 0101 1010 0101 1111 0101
Would be compressed to: 0 11100 0 0 11010 0 11111 0

Thats a compression of 32 bits -> 20 bits.

An important lesson: The compression algorithm choice is entirely dependent on the input. The wrong algorithm and you will likely make the data longer.

Upvotes: 2