alpha_ulrich
alpha_ulrich

Reputation: 556

Shortest string with lowest edit distance from set of strings

Suppose I have a list of strings, which are similar. I wish to figure out the common part or characteristic of all these strings. Is there a known way to figure out a string which is most similar to all strings in a given set, and does not belong to the set?

For example, if I have the following set:

Hello
Hell
Help
Hepl

'Hel' gives a levenshtein distance of 2,1,1,1. Currently I am thinking of taking different substrings as base, and computing the distance (my sets are fairly small, so brute forcing will not be an issue), but this solution does not find strings which are not essentially substrings of any given string in the set, but might be the most optimal solution (cases like where the solution is conjugation of two substrings).

Any leads regarding this would be appreciated.

Upvotes: 2

Views: 816

Answers (1)

stefan
stefan

Reputation: 3759

You said brute force is acceptable :-). The classic approach then is breadth first search. For each string in your list you generate all strings with an edit distance of 1. from those you do all distance 2 strings and so on. for every given string you get a tree of mutated strings. After every round (distance) you check if there is a string common to every tree.

pseudocode for a levenshtein distance:

alphabet = "abcd..."
starters = "Hello", "Hell", "Help", "Hepl"
relatives = set()
distance = 0
for word in starters
    trees[word][distance] = word

while len(relatives) == 0
    distance++
    for tree in trees
        for word in tree[distance-1]
            for pos in range(len(word))
                new_word = word.erase(pos)
                if new_word not in tree
                    tree[distance].insert(new_word)
                    dict[new_word] += 1
                    if dict[new_word] == len(starters)
                        relatives.insert(new_word)
            for pos in range(len(word))
                for letter in alphabet
                    new_word = word.replace(pos, letter)
                    if new_word not in tree:
                        tree[distance].insert(new_word)
                        dict[new_word] += 1
                        if dict[new_word] == len(starters)
                            relatives.insert(new_word)
            for pos in range(len(word) + 1):
                for letter in alphabet
                    new_word = word.insert(pos, letter)
                    if new_word not in tree
                        tree[distance].insert(new_word)
                        dict[new_word] += 1
                        if dict[new_word] == len(starters)
                            relatives.insert(new_word)
print relatives

Upvotes: 1

Related Questions