Reputation: 21333
I was given this code that quickly returns whether the Levenshtein distance between two strings is exactly 2.
def li(s, i):
try:
return s[i]
except IndexError:
return None
def f(str1, str2):
t = [4, 4, 1, 2, 3]
for i, str1_symb in enumerate(str1):
p = 4
res = []
for j, t_val in enumerate(t):
p = min(t_val - (str1_symb == li(str2, i + j - 2)), p, li(t, j + 1) or 4) + 1
res.append(p)
t = res
return li(t, len(str2) - len(str1) + 2) == 3
You can test it with:
f("zzzzfood", "zzzzdodod")
for example which will return True
and
f("zzzzfood", "zzzzdodo")
which will return False.
The standard algorithm for computing the Levenshtein distance builds a dynamic programming table and fills in the elements from left to right, top to bottom using the formula:
(from the wiki page linked above)
If you only want to return if the Levenshtein distance is at most 2 you can only look at cells of the dynamic programming that are at most 2 right or left from the diagonal.
The code above doesn't obviously do that and I can't work out what it is doing. Some particularly mysterious parts:
i
is greater than or equal to len(s)
. Sometimes i
will be negative where it will still return a letter from s
.li(t, j + 1) or 4
returns 4 if li(t, j + 1)
is None
but I don't know what its purpose is.p
?Can anyone decipher it?
Upvotes: 0
Views: 481
Reputation: 3104
Alright -- so first, please don't use this code.
At a high level what its doing is what Stef said in the comments, checking the near diagonals.
The index position of i
is iterating and the looped j
is walking across the second string near the diagonal
Adding some prints make this a bit clearer:
def f(str1, str2):
t = [4, 4, 1, 2, 3]
for i, str1_symb in enumerate(str1):
print()
print(f"i={i}, str1_symb={str1_symb}")
p = 4
res = []
print("res:", res)
for j, t_val in enumerate(t):
print()
print(f" - j={j}, t_val={t_val}")
p1 = t_val - (str1_symb == li(str2, i + j - 2))
p2 = p
p3 = li(t, j + 1) or 4
print(f" - p1a: t_val - (str1_symb == li(str2, i + j - 2)) ==> {t_val} - ({str1_symb} == li({str2}, {i} + {j} - 2))")
print(f" - p1b: t_val - (str1_symb == li(str2, i + j - 2)) ==> {t_val} - ({str1_symb} == li({str2}, {i + j - 2}))")
print(f" - p1c: t_val - (str1_symb == li(str2, i + j - 2)) ==> {t_val} - ({str1_symb} == {li(str2, i + j - 2)})")
print(f" - p1d: t_val - (str1_symb == li(str2, i + j - 2)) ==> {t_val} - {(str1_symb == li(str2, i + j - 2))}")
print(f" - p1e: t_val - (str1_symb == li(str2, i + j - 2)) ==> {p1}")
print(" - ")
print(f" - p2: {p}")
print(f" - ")
print(f" - p3a: li(t, j + 1) or 4 ==> li({t}, {j} + 1) or 4")
print(f" - p3b: li(t, j + 1) or 4 ==> li({t}, {j + 1}) or 4")
print(f" - p3c: li(t, j + 1) or 4 ==> {li(t, j + 1)} or 4")
print(f" - p3: {li(t, j + 1) or 4}")
print(f" - ")
p = min(p1, p2, p3) + 1
print(f" - p: min(p1, p2, p3) + 1 ==> min({p1}, {p2}, {p3}) + 1")
print(f" - p: min(p1, p2, p3) + 1 ==> {min(p1, p2, p3) + 1}")
res.append(p)
t = res
print(f"t = {t}")
print()
print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> li({t}, {len(str2)} - {len(str1)} + 2) == 3")
print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> li({t}, {len(str2) - len(str1) + 2}) == 3")
print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> {li(t, len(str2) - len(str1) + 2)} == 3")
print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> {li(t, len(str2) - len(str1) + 2) == 3}")
return li(t, len(str2) - len(str1) + 2) == 3
My read on t = [4, 4, 1, 2, 3]
is that it's setting offset values, and maxing (so that we effectively disregard the -ive list index values, anything greater than 3 should produce the same results)
the negative list indices are superfluous computations, but harmless.
We can see how i & j move across the matrix:
Which produces the following difference matrix:
Breaking the p computation p = min(t_val - (str1_symb == li(str2, i + j - 2)), p, li(t, j + 1) or 4) + 1
into it's three parts makes it's work clearer:
p1 = t_val - (str1_symb == li(str2, i + j - 2))
p2 = p
p3 = li(t, j + 1) or 4
p = min(p1, p2, p3) + 1
We can see that this step is using t/res
as a memory on the last computation where we lookup the previously calculated offsets and compute the new value
The last note is, it looks like everything is scaled up by 1
so that the author can use -true to indicate a difference; this is then added back to the min
which scales a distance of 2
to a distance of 3
leading to the final equality
This was so hard to read I kept gaining some understanding then losing the thought, an interesting reverse engineering problem, but terrible production code.
As I spent a little time trying to come up with a better approach I realized I ended up heading down the same path; despite my criticisms of the readability I have to give props to the author for their efficiency. Computing and storing the algorithms else condition and carrying it down is clever and efficient.
Upvotes: 7