Reputation: 3732

How to find string similar to 2 other strings (in terms of Levenshtein distance)?

Let's say I have 2 strings which is pretty similar. I want to find other string which is close to s1 and s2 in terms of Levenshtein distance.

import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
mid_str = get_avg_string(s1, s2)

What can be effective implementation of function:

def get_avg_string(s1, s2):
    return ''

I need that this variables:

sum_lev = Levenshtein.distance(s1, mid_str) + Levenshtein.distance(s2, mid_str)
diff_lev = abs(Levenshtein.distance(s1, mid_str) - Levenshtein.distance(s2, mid_str)

to be minimal (I think sum_lev will be equal to dist and diff_lev <= 1).

Upvotes: 8

Answers (6)

ZFTurbo

Reputation: 3732

Here is the code which returns list of all strings between s1 and s2. So you can choose the one in the middle.

def get_all_avg_strings(s1, s2):
    import Levenshtein
    dist = Levenshtein.distance(s1, s2)
    s1_new = s1
    intermediate_strings = [s1_new]
    for i in range(dist):
        e = Levenshtein.editops(s1_new, s2)
        s1_new = Levenshtein.apply_edit(e[:1], s1_new, s2)
        intermediate_strings.append(s1_new)
    check = Levenshtein.distance(s2, s1_new)
    if check != 0:
        print('Some error here!')
        exit()
    return intermediate_strings[1:-1]

Then you can create the requested function as:

def get_avg_string(s1, s2):
    avg = get_all_avg_strings(s1, s2)
    if len(avg) > 0:
        s = avg[len(avg) // 2]
    else:
        s = s1
    return s

Upvotes: 0

Red

Reputation: 27577

This is not a solution, but a very basic idea to finding a string that is close to two strings:

import Levenshtein

s1 = 'aaabbbccc'
s2 = 'abacbbbccde'

dist = Levenshtein.distance(s1, s2)
print(dist)

def get_avg_string(s1, s2):
    s2, s1 = sorted([s1, s2], key=len)
    s3 = ''.join(a if i & 1 else b for i, (a, b) in enumerate(zip(s1, s2)))
    return s3

mid_str = get_avg_string(s1, s2)

print(Levenshtein.distance(s1, mid_str))
print(Levenshtein.distance(s2, mid_str))

Output:

4
2
3

Explaining the function

def get_avg_string(s1, s2):
    s2, s1 = sorted([s1, s2], key=len)
    s3 = ''.join(a if i & 1 else b for i, (a, b) in enumerate(zip(s1, s2)))
    return s3

The loop

for i, (a, b) in enumerate(zip(s1, s2))

will iterate through strings s1 and s2 in parallel using the zip() method, along with the index of each iteration with the enumerate() method.

The condition

a if i & 1 else b

will take the part from s1, a, if the index is odd (if & 1 returns 1 rather than 0), else the part from s2, b.

I used

s2, s1 = sorted([s1, s2], key=len)

to make the line below it take the first character from the longest string.

Upvotes: 1

Memristor

Reputation: 650

This is not a solution but something to start with; an immediate improvement would be on function eq_len_strings that is equalizing the strings to the right; also you can take sub-strings of s1 on s2 to create your first mid-string (since this helps reduce the Levenshtein distance) and then just fill the _ with any character on your alphabet as search_mid_string does.

Another improvement is avoid (contrary to what I do) to fill all blanks (_) maybe adding the empty string to your alphabet or considering the difference in length for both strings.

import Levenshtein

def eq_len_strings(s1, s2):
    if len(s1) < len(s2):
        s1 = s1.ljust(len(s1)+len(s2)-len(s1), '_')
    elif len(s1) > len(s2):
        s2 = s2.ljust(len(s2)+len(s1)-len(s2), '_')
    return s1, s2

def keep_eq_chars(s1, s2):
    s = ''
    for char1, char2 in zip(s1, s2):
        if char1 == '_' or char2 == '_' or char1 != char2:
            s += '_'
        else:
            s += char1
    return s

def search_mid_string(s1, s2):
    alph = set(list(s1+s2))
    s1e, s2e = eq_len_strings(s1, s2)
    # start string
    s_mid = list(keep_eq_chars(s1e, s2e))
    
    alternate = 0
    for i in range(len(s_mid)):
        char1 = s1[i] if i < len(s1)-1 else '_'
        char2 = s2[i] if i < len(s2)-1 else '_'
        if s_mid[i] == '_':
            if alternate == 0 and char1 != '_':
                s_mid[i] = char1
                alternate = 1
            elif alternate == 1 and char2 != '_':
                s_mid[i] = char2
                alternate = 0
            else:
                s_mid[i] = ''
    s1_to_s2_dist = Levenshtein.distance(s1, s2)
    s1_to_mid_dist = Levenshtein.distance(s1, ''.join(s_mid))
    s2_to_mid_dist = Levenshtein.distance(s2, ''.join(s_mid))
    ans = {
        's1_to_s2_dist': s1_to_s2_dist,
        's1_to_mid_dist': s1_to_mid_dist,
        's2_to_mid_dist': s2_to_mid_dist,
        's_mid': ''.join(s_mid)
    }
    return ans

With the strings given:

s1 = 'aaabbbccc'
s2 = 'abacbbbccde'

search_mid_string(s1, s2)

// Output
>{'s1_to_s2_dist': 4,
> 's1_to_mid_dist': 2,
> 's2_to_mid_dist': 3,
> 's_mid': 'aaacbbcccd'}

Upvotes: 2

Iluvatar

Reputation: 672

Assumption

So first of all let's assume string1(lets call it s1) is the before and string2(lets call it s2) after transformation. This way we can easily separate add and remove character operations.

The example

Let's consider example given by you. Levenshtein.distance(s1='aaabbbccc', s2='abacbbbccde') This means we are asking question how many operations separete these string(How much it costs to mutate one into other).

Levenshtein matrix

Now that we assume s1 is the start point, let's see what the algorithm gives us.
We can calculate that distance between s1 and s2, and it spits out integer value of 4
It comes from the last value of the calculated Levenshtein matrix, like so:

Walk Levennshtein matrix

As we can see there are places where value goes up and where it stays the same.
If we go over the matrix from the top left corner, we should read it like:

going down means: adding a character to the s1 string
going right means: removing a character from the s1 string
going diagonally down-right means: replacing the character

Our goal is to get to the bottom-right corner and the result value is the cost(or distance) that is associated with it.

Distance change in matrix

Let's see how matrix will change its values when we change last value in s1 As we can see previous intersection of cxd changed to dxd and now the cost does not rise in this place and that results in smaller distance between those strings.
What we see is that small changes in s1 will result in distance change of 1 and if we compare original s1 to the modified one:

It look preety damn close in term of Levenshtein distance.

Conclusion

There is potentially an algorithm to generate quite a lot of strings similar to s1 and s2.
It would go over generated matrix and change single character in a string to generate next solution.
We should consider multiple changes done to a original matrix. And then for each new solution we potentially want to calculate new Levenshtein matrix and use it as next source to generate solutions.
And we should never consider only lowering these values, that would generate only portion of potential solutions.

One thing that is important to consider. In term of Levenshtein distance it does not matter if we compare character a to b or a to c all that is important is that "Are those the same character?" if not we do not care about its value.

Upvotes: 5

konserw

Reputation: 554

A little expansion of @Matze's answer - when you have NP-hard problem to solve you could use genetic algorithm to find some solution that might be better then just taking one of strings in finite time (no guarantees that it would be best or even better then one of original strings)

Upvotes: 2

Matze

Reputation: 368

I'm afraid that what you ask for is not possible, since the problem is NP-hard. I will try to outline a few of the key concepts for why that is the case, but I'd encourage you to look up Center Strings and Steiner Strings.

Suppose that you have a set of strings called S. The optimal Steiner String is a string which minimizes the sum of the distances of each string in S to itself (also known as consensus error). This corresponds to the first property, which you called sum_lev. The optimal Steiner String is usually ambiguous and not part of the original set S (but doesn't have to be).

The problem you are facing is that there is no efficient way to compute the optimal Steiner String. Even if you manage to restrict your search space, you will still have an exponential amount of candidates. The problem is hence NP-hard.

It can be proven that S always contains a string which is a reasonable approximation of the optimal Steiner String. So even if you only pay attention to the first of your two properties, the best shot you have is to simply select one of your original strings. Since you are apparently only dealing with two strings, it should not matter which one you choose.

TL;DR

To summarize, you are dealing with an NP-hard problem which can not be solved efficiently, but only approximated. If you are only dealing with two strings, the string you are looking for can be approximated by one of the given strings. I'm sorry that this is probably not the answer you hoped for, but hopefully it was still somewhat helpful.

Upvotes: 6