Reputation: 570

Map a set of strings with similarities to shorter strings

i have a set of strings each of same length (10chars) with the following properties. The size of the set is around 5000 - 10,000 strings. The data set can change frequently. Although each string is unique, a sub string of a particular pattern would appear in most of these strings not necessarily at the same position.

Some examples are
123abc7gh0
t123abcmla
wp12123abc

123abc being the substring which appears in most of the strings

The problem is to map each string to a shorter string, and such mapping should be deterministic. I could use a simple enumeration algorithm which maps each string encountered to an incremented counter value(on set of sorted strings). But since the set is bound to change frequently, i cannot use this algorithm to compute the map in a deterministic way for various runs. I could also use data compression algorithm like Huffman encoding to compress each individual string. But i do not believe that would be effective as each string in itself has very less duplicate characters. what should be the approach i should adapt to solve the problem by taking advantage of the properties of the data set? Note that i do not want to compress the whole set of data but would like to map each string in the set to a shortened string.

Upvotes: 3

Answers (3)

Tillmann

Reputation: 149

I'm confronted with the same kind of task and wonder whether it is possible to achieve the mapping without making use of persistence.

If persisting the mappings in (previous) use is allowed, then the solution is simple: you can just assign a number to each of the strings (using a representation of a sufficiently high base so that you get the required maximum size of the numbers' string representation). For each of the source strings you would assign a next number and using the persisted mappings make sure not to use the same number a second time. This policy would give you consistent results, even if you go through the procedure multiple times with a changing set of data: a string occurring for the first time would receive its private number and this private number would stay reserved to it forever - numbers that are no longer in use would never be reused.

The more challenging question is: is it possible to guarantee uniqueness without the aid of a persisted mapping? I'm afraid it is not, since size reduction is always prone to lead to collisions.

Upvotes: 1

alzaimar

Reputation: 4622

Replace the 'common string' by a character not appearing elsewhere in any string.
Do a probabilistic analysis of all strings
Create a Hufman tree based on the analysis, i.e. most frequent characters are at the top of the tree, resulting in short codes.
Replace sample strings by their hufman encoding based on the tree of #3 and compare the resulting size with the original. If most of the characters are uniformly spread even between the strings, then the Hufman coding will not reduce but increase the size.

If Hufman does not gain any improvement, you might try LZW or any other dictionary based compression method. However, this only works, if the structure of the strings (i.e. the distribution of characters/substrings) does not completely change over time. For example, if the strings would consist of english words, the substring dictionary compression (LZW) might be a good candidate.

But if the distribution changes or the character distribution is merely equal over all characters, I am afraid there is no compression method suitable of reducing the string size.

But the last question remains: What for? Why bother compressing 10000 strings?

Edit: The answer is: The strings are used to create folder names (paths). As there is a restriction on the total length, it should be as compact as possible.

You might try to create a database (i.e. dictionary) and use the index (coded e.g. as Base64) as a compressed string. This gives you a maximum of 5 chars when assuming a maximum dictionary size of 2^32-1.

Upvotes: 1

redmund

Reputation: 31

If you can pre-process the set of strings and could know the pattern which occurs in each of the strings, you could treat that as a single character (use some encoding) which would shorten that string.

Upvotes: 1

Map a set of strings with similarities to shorter strings

Answers (3)

Related Questions