Reputation:
I'm having trouble figuring out the algorithm for this problem. I'll paste the problem description and how I kind of solved it, although it's not the correct solution. It is similar to the edit distance algorithm and I used the same approach, but something is off and i cannot figure out exactly what
The deletion distance between two strings is the minimum sum of ASCII values of characters that you need to delete in the two strings in order to have the same string. The deletion distance between cat and at is 99, because you can just delete the first character of cat and the ASCII value of 'c' is 99. The deletion distance between cat and bat is 98 + 99, because you need to delete the first character of both words. Of course, the deletion distance between two strings can't be greater than the sum of their total ASCII values, because you can always just delete both of the strings entirely. Implement an efficient function to find the deletion distance between two strings.You can refer to the Wikipedia article on the algorithm for edit distance if you want to. The algorithm there is not quite the same as the algorithm required here, but it's similar.
This is my code. I used a dynamic programming approach. I would say the line after the last "else" needs to be changed, but feel free to correct any mistake
def delete_distance(s1, s2):
m = [[0 for j in range(len(s2)+1)] for i in range(len(s1)+1)]
for i in range(len(s1)+1):
for j in range(len(s2)+1):
if i == 0:
m[i][j] = sum(bytearray(s2[:j]))
elif j == 0:
m[i][j] = sum(bytearray(s1[:i]))
elif s1[i-1] == s2[j-1]:
m[i][j] = m[i-1][j-1]
else:
m[i][j] = ord(s1[i-1]) + ord(s2[j-1]) + min(m[i-1][j-1], m[i-1][j], m[i][j-1])
return m[len(s1)][len(s2)]
I know it's wrong because the output of delete_distance('cat', 'cbat') is 197, and the correct result should be 98, because we only need to delete b which has an ASCII value of 98.
Upvotes: 3
Views: 22892
Reputation: 2416
As mentioned in the previous answer by Ken Y-N, the else part should be minimum of 3 operations cost. The only change in this answer is - it is rephrased to suite your problem.
The 3 operations are:
The following should work - I guess:
def delete_distance(s1, s2):
m = [[0 for j in range(len(s2)+1)] for i in range(len(s1)+1)]
for i in range(len(s1)+1):
for j in range(len(s2)+1):
if i == 0:
m[i][j] = sum(bytearray(s2[:j]))
elif j == 0:
m[i][j] = sum(bytearray(s1[:i]))
elif s1[i-1] == s2[j-1]:
m[i][j] = m[i-1][j-1]
else:
s1del = ord(s1[i-1])
s2del = ord(s2[j-1])
s1s2del = s1del + s2del
m[i][j] = min(m[i-1][j-1] + s1s2del, m[i-1][j] + s1del, m[i][j-1] + s2del)
return m[len(s1)][len(s2)]
Hope it helps!
Upvotes: 4
Reputation: 15008
Looking at the relevant Wiki page, I see that the last else:
should be the minimum of the distances so far plus the insert/delete/substitution cost. So, redoing that term with intermediate values to hopefully illustrate this point better, we get:
else:
wdel = ord(s1[i-1])
wins = ord(s2[j-1])
wsub = wdel + wins
m[i][j] = min(m[i-1][j-1] + wsub, m[i-1][j] + wdel, m[i][j-1] + wins)
Note, if you use wdel = wins = wsub = 1
and m[i][j] = len(s1)
or s2
you get the classic Levenshtein distance.
Upvotes: 0