Markus Johansson
Markus Johansson

Reputation: 3773

Find the prefix substring which gives best compression

Problem:

Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.

Example:

"foo", "fool", "bar"

The result is: "foo" as the base string with the strings "\0", "\0l", "bar" and a total length of 9 bytes. "\0" is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.

A naive algorithm would look like:

for string in list
  for i = 1, i < length of string
      calculate total length based on prefix of string[0..i]
      if better than last best, save it
return the best prefix

That will give us the answer, but it's something like O((n*m)^2), which is too expensive.

Upvotes: 2

Views: 983

Answers (3)

nlucaroni
nlucaroni

Reputation: 47934

Use a forest of prefix trees (trie)...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.

On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.

I think this is clear, I am not going to write it up --what is this a homework assignment? ;)

Upvotes: 7

James Curran
James Curran

Reputation: 103515

Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.

Upvotes: 1

EBGreen
EBGreen

Reputation: 37740

I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.

Upvotes: 1

Related Questions