dreeves
dreeves

Reputation: 26932

Repetition-based, pattern-based data compression algorithm

Suppose I have the following string:

ABCADCADCADABC

I want to compress it by finding repeating substrings. What's an algorithm that gives the optimal compression?

In the above example it should return

AB*1 CAD*3 ABC*1

For comparison, a greedy algorithm might return

ABC*1 ADC*2 AD*1 ABC*1

Upvotes: 0

Views: 3988

Answers (4)

Chris Cain
Chris Cain

Reputation: 666

This sounds like a job for suffix arrays/trees!

http://en.wikipedia.org/wiki/Suffix_array

You can use a suffix array built over your string to figure out patterns that repeat. For instance, we can build a suffix array over your example as follows (I'm using $ as always coming after every letter, you can sort it so that $ comes before every letter ... either way will work):

ABCADCADCADABC$
ABC$
ADABC$
ADCADABC$
ADCADCADABC$
BCADCADCADABC$
BC$
CADABC$
CADCADABC$
CADCADCADABC$
C$
DABC$
DCADABC$
DCADCADABC$
$

From this, we can more easily see the common patterns in the string. Using the information in this suffix array representation, we can see that CAD is repeated 3x in a local area, and we'd likely use this as our choice for compression. ADC and DCA and so on are not as attractive because they compress less of the string.

http://en.wikipedia.org/wiki/Suffix_tree

Suffix trees are more efficient ways of doing the same task. Once you wrap your head around how to do something using suffix arrays, it's not too far of a jump to go onto suffix trees. In fact, this is used in popular compression algorithms including LZW 1 and BWT (Bzip) 2.

Upvotes: 3

mcdowella
mcdowella

Reputation: 19601

It may not be practically relevant, but for the particular question you ask there is a dynamic programming solution. If you have computed the optimum way to compress the strings of length 1, 2, 3...n-1 starting from the first character, then you can compute the optimum way to compress the string of length n starting from the first character by looking at the last k characters for each possibility k and seeing if they form a multiple of a simple string. If so, compute the cost of compressing the first n-k characters and then expressing the last k characters using a multiple of a string.

So in your example you would finish up by noticing that ABC was a multiple of itself, and that if you expressed this as ABC*1 you could use the answer you had already worked out for the first 11 characters of AB CAD*3 to produce AB*1 CAD*3 ABC*1

Upvotes: 2

Mark Adler
Mark Adler

Reputation: 112284

Better still would be:

ABCAD(6,3)(3,11)

where (n,d) is a length and distance back of a match. So (6,3) copies six bytes starting from three bytes back. While that may sound a little odd, by the time it gets three bytes in, the next three bytes it needs have been copied. So CADCAD is appended. The (3,11) causes ABC to be appended.

This is called LZ77 compression. It is what is implemented by zip, gzip, and zlib using the deflate compressed data format. That format not only references previous string matches, but also uses Huffman compression on the literals (e.g. ABCAD) as well as the lengths and distances.

Upvotes: 1

smocking
smocking

Reputation: 3709

Depending on whether you prefer fast and simple or high compression ratio you could take a look into the Lempel-Ziv-Welch (LZW) or Lempel-Ziv-Markov chain (LZMA) algorithms. They both keep dictionaries of recurring strings.

Upvotes: 3

Related Questions