Reputation: 3732
Let's say I have 2 strings which is pretty similar. I want to find other string which is close to s1 and s2 in terms of Levenshtein distance.
import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
mid_str = get_avg_string(s1, s2)
What can be effective implementation of function:
def get_avg_string(s1, s2):
return ''
I need that this variables:
sum_lev = Levenshtein.distance(s1, mid_str) + Levenshtein.distance(s2, mid_str)
diff_lev = abs(Levenshtein.distance(s1, mid_str) - Levenshtein.distance(s2, mid_str)
to be minimal (I think sum_lev
will be equal to dist
and diff_lev <= 1
).
Upvotes: 8
Views: 1728
Reputation: 3732
Here is the code which returns list of all strings between s1 and s2. So you can choose the one in the middle.
def get_all_avg_strings(s1, s2):
import Levenshtein
dist = Levenshtein.distance(s1, s2)
s1_new = s1
intermediate_strings = [s1_new]
for i in range(dist):
e = Levenshtein.editops(s1_new, s2)
s1_new = Levenshtein.apply_edit(e[:1], s1_new, s2)
intermediate_strings.append(s1_new)
check = Levenshtein.distance(s2, s1_new)
if check != 0:
print('Some error here!')
exit()
return intermediate_strings[1:-1]
Then you can create the requested function as:
def get_avg_string(s1, s2):
avg = get_all_avg_strings(s1, s2)
if len(avg) > 0:
s = avg[len(avg) // 2]
else:
s = s1
return s
Upvotes: 0
Reputation: 27547
This is not a solution, but a very basic idea to finding a string that is close to two strings:
import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
def get_avg_string(s1, s2):
s2, s1 = sorted([s1, s2], key=len)
s3 = ''.join(a if i & 1 else b for i, (a, b) in enumerate(zip(s1, s2)))
return s3
mid_str = get_avg_string(s1, s2)
print(Levenshtein.distance(s1, mid_str))
print(Levenshtein.distance(s2, mid_str))
Output:
4
2
3
Explaining the function
def get_avg_string(s1, s2):
s2, s1 = sorted([s1, s2], key=len)
s3 = ''.join(a if i & 1 else b for i, (a, b) in enumerate(zip(s1, s2)))
return s3
The loop
for i, (a, b) in enumerate(zip(s1, s2))
will iterate through strings s1
and s2
in parallel using the zip()
method, along with the index of each iteration with the enumerate()
method.
The condition
a if i & 1 else b
will take the part from s1
, a
, if the index is odd (if & 1
returns 1
rather than 0
), else the part from s2
, b
.
I used
s2, s1 = sorted([s1, s2], key=len)
to make the line below it take the first character from the longest string.
Upvotes: 1
Reputation: 650
This is not a solution but something to start with; an immediate improvement would be on function eq_len_strings
that is equalizing the strings to the right; also you can take sub-strings of s1
on s2
to create your first mid-string (since this helps reduce the Levenshtein distance) and then just fill the _
with any character on your alphabet as search_mid_string
does.
Another improvement is avoid (contrary to what I do) to fill all blanks (_
) maybe adding the empty string to your alphabet or considering the difference in length for both strings.
import Levenshtein
def eq_len_strings(s1, s2):
if len(s1) < len(s2):
s1 = s1.ljust(len(s1)+len(s2)-len(s1), '_')
elif len(s1) > len(s2):
s2 = s2.ljust(len(s2)+len(s1)-len(s2), '_')
return s1, s2
def keep_eq_chars(s1, s2):
s = ''
for char1, char2 in zip(s1, s2):
if char1 == '_' or char2 == '_' or char1 != char2:
s += '_'
else:
s += char1
return s
def search_mid_string(s1, s2):
alph = set(list(s1+s2))
s1e, s2e = eq_len_strings(s1, s2)
# start string
s_mid = list(keep_eq_chars(s1e, s2e))
alternate = 0
for i in range(len(s_mid)):
char1 = s1[i] if i < len(s1)-1 else '_'
char2 = s2[i] if i < len(s2)-1 else '_'
if s_mid[i] == '_':
if alternate == 0 and char1 != '_':
s_mid[i] = char1
alternate = 1
elif alternate == 1 and char2 != '_':
s_mid[i] = char2
alternate = 0
else:
s_mid[i] = ''
s1_to_s2_dist = Levenshtein.distance(s1, s2)
s1_to_mid_dist = Levenshtein.distance(s1, ''.join(s_mid))
s2_to_mid_dist = Levenshtein.distance(s2, ''.join(s_mid))
ans = {
's1_to_s2_dist': s1_to_s2_dist,
's1_to_mid_dist': s1_to_mid_dist,
's2_to_mid_dist': s2_to_mid_dist,
's_mid': ''.join(s_mid)
}
return ans
With the strings given:
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
search_mid_string(s1, s2)
// Output
>{'s1_to_s2_dist': 4,
> 's1_to_mid_dist': 2,
> 's2_to_mid_dist': 3,
> 's_mid': 'aaacbbcccd'}
Upvotes: 2
Reputation: 662
So first of all let's assume string1
(lets call it s1
) is the before and string2
(lets call it s2
) after transformation. This way we can easily separate add and remove character operations.
Let's consider example given by you.
Levenshtein.distance(s1='aaabbbccc', s2='abacbbbccde')
This means we are asking question how many operations separete these string(How much it costs to mutate one into other).
Now that we assume s1
is the start point, let's see what the algorithm gives us.
We can calculate that distance between s1
and s2
, and it spits out integer value of 4
It comes from the last value of the calculated Levenshtein matrix, like so:
As we can see there are places where value goes up and where it stays the same.
If we go over the matrix from the top left corner, we should read it like:
s1
strings1
stringOur goal is to get to the bottom-right corner and the result value is the cost(or distance) that is associated with it.
Let's see how matrix will change its values when we change last value in s1
As we can see previous intersection of c
xd
changed to d
xd
and now the cost does not rise in this place and that results in smaller distance between those strings.
What we see is that small changes in s1
will result in distance change of 1
and
if we compare original s1
to the modified one:
It look preety damn close in term of Levenshtein distance.
There is potentially an algorithm to generate quite a lot of strings similar to s1
and s2
.
It would go over generated matrix and change single character in a string to generate next solution.
We should consider multiple changes done to a original matrix. And then for each new solution we potentially want to calculate new Levenshtein matrix and use it as next source to generate solutions.
And we should never consider only lowering these values, that would generate only portion of potential solutions.
One thing that is important to consider. In term of Levenshtein distance it does not matter if we compare character a
to b
or a
to c
all that is important is that "Are those the same character?" if not we do not care about its value.
Upvotes: 5
Reputation: 554
A little expansion of @Matze's answer - when you have NP-hard problem to solve you could use genetic algorithm to find some solution that might be better then just taking one of strings in finite time (no guarantees that it would be best or even better then one of original strings)
Upvotes: 2
Reputation: 368
I'm afraid that what you ask for is not possible, since the problem is NP-hard. I will try to outline a few of the key concepts for why that is the case, but I'd encourage you to look up Center Strings and Steiner Strings.
Suppose that you have a set of strings called S. The optimal Steiner String is a string which minimizes the sum of the distances of each string in S to itself (also known as consensus error). This corresponds to the first property, which you called sum_lev
. The optimal Steiner String is usually ambiguous and not part of the original set S (but doesn't have to be).
The problem you are facing is that there is no efficient way to compute the optimal Steiner String. Even if you manage to restrict your search space, you will still have an exponential amount of candidates. The problem is hence NP-hard.
It can be proven that S always contains a string which is a reasonable approximation of the optimal Steiner String. So even if you only pay attention to the first of your two properties, the best shot you have is to simply select one of your original strings. Since you are apparently only dealing with two strings, it should not matter which one you choose.
TL;DR
To summarize, you are dealing with an NP-hard problem which can not be solved efficiently, but only approximated. If you are only dealing with two strings, the string you are looking for can be approximated by one of the given strings. I'm sorry that this is probably not the answer you hoped for, but hopefully it was still somewhat helpful.
Upvotes: 6