Simd
Simd

Reputation: 21333

Can anyone explain this code for computing Levenshtein distance?

I was given this code that quickly returns whether the Levenshtein distance between two strings is exactly 2.

def li(s, i):
    try:
        return s[i]
    except IndexError:
        return None
    
def f(str1, str2):
 t = [4, 4, 1, 2, 3]
 for i, str1_symb in enumerate(str1):
    p = 4
    res = []
    for j, t_val in enumerate(t):
        p = min(t_val - (str1_symb == li(str2, i + j - 2)), p, li(t, j + 1) or 4) + 1
        res.append(p)
    t = res
 return li(t, len(str2) - len(str1) + 2) == 3

You can test it with:

f("zzzzfood", "zzzzdodod") 

for example which will return True

and

f("zzzzfood", "zzzzdodo")

which will return False.

The standard algorithm for computing the Levenshtein distance builds a dynamic programming table and fills in the elements from left to right, top to bottom using the formula:

enter image description here

(from the wiki page linked above)

If you only want to return if the Levenshtein distance is at most 2 you can only look at cells of the dynamic programming that are at most 2 right or left from the diagonal.

The code above doesn't obviously do that and I can't work out what it is doing. Some particularly mysterious parts:

Can anyone decipher it?

Upvotes: 0

Views: 481

Answers (1)

Schalton
Schalton

Reputation: 3104

Alright -- so first, please don't use this code.

At a high level what its doing is what Stef said in the comments, checking the near diagonals. The index position of i is iterating and the looped j is walking across the second string near the diagonal

Adding some prints make this a bit clearer:

def f(str1, str2):
    t = [4, 4, 1, 2, 3]
    for i, str1_symb in enumerate(str1):
        print()
        print(f"i={i}, str1_symb={str1_symb}")
        p = 4
        res = []
        print("res:", res)
        for j, t_val in enumerate(t):
            print()
            print(f" - j={j}, t_val={t_val}")
            p1 = t_val - (str1_symb == li(str2, i + j - 2))
            p2 = p
            p3 = li(t, j + 1) or 4
            print(f" - p1a: t_val - (str1_symb == li(str2, i + j - 2))  ==>  {t_val} - ({str1_symb} == li({str2}, {i} + {j} - 2))")
            print(f" - p1b: t_val - (str1_symb == li(str2, i + j - 2))  ==>  {t_val} - ({str1_symb} == li({str2}, {i + j - 2}))")
            print(f" - p1c: t_val - (str1_symb == li(str2, i + j - 2))  ==>  {t_val} - ({str1_symb} == {li(str2, i + j - 2)})")
            print(f" - p1d: t_val - (str1_symb == li(str2, i + j - 2))  ==>  {t_val} - {(str1_symb == li(str2, i + j - 2))}")
            print(f" - p1e: t_val - (str1_symb == li(str2, i + j - 2))  ==>  {p1}")
            print(" - ")
            print(f" - p2: {p}")
            print(f" - ")
            print(f" - p3a: li(t, j + 1) or 4 ==> li({t}, {j} + 1) or 4")
            print(f" - p3b: li(t, j + 1) or 4 ==> li({t}, {j + 1}) or 4")
            print(f" - p3c: li(t, j + 1) or 4 ==> {li(t, j + 1)} or 4")
            print(f" - p3: {li(t, j + 1) or 4}")
            print(f" - ")
            p = min(p1, p2, p3) + 1
            print(f" - p: min(p1, p2, p3) + 1 ==> min({p1}, {p2}, {p3}) + 1")
            print(f" - p: min(p1, p2, p3) + 1 ==> {min(p1, p2, p3) + 1}")
            res.append(p)
        t = res
        print(f"t = {t}")
    print()
    print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> li({t}, {len(str2)} - {len(str1)} + 2) == 3")
    print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> li({t}, {len(str2) - len(str1) + 2}) == 3")
    print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> {li(t, len(str2) - len(str1) + 2)} == 3")
    print(f"result: li(t, len(str2) - len(str1) + 2) == 3 ==> {li(t, len(str2) - len(str1) + 2) == 3}")
    return li(t, len(str2) - len(str1) + 2) == 3

My read on t = [4, 4, 1, 2, 3] is that it's setting offset values, and maxing (so that we effectively disregard the -ive list index values, anything greater than 3 should produce the same results) the negative list indices are superfluous computations, but harmless.

We can see how i & j move across the matrix:

nested loop order

Which produces the following difference matrix:

diagonal computations

Breaking the p computation p = min(t_val - (str1_symb == li(str2, i + j - 2)), p, li(t, j + 1) or 4) + 1 into it's three parts makes it's work clearer:

p1 = t_val - (str1_symb == li(str2, i + j - 2))
p2 = p
p3 = li(t, j + 1) or 4
p = min(p1, p2, p3) + 1

We can see that this step is using t/res as a memory on the last computation where we lookup the previously calculated offsets and compute the new value

The last note is, it looks like everything is scaled up by 1 so that the author can use -true to indicate a difference; this is then added back to the min which scales a distance of 2 to a distance of 3 leading to the final equality


This was so hard to read I kept gaining some understanding then losing the thought, an interesting reverse engineering problem, but terrible production code.


As I spent a little time trying to come up with a better approach I realized I ended up heading down the same path; despite my criticisms of the readability I have to give props to the author for their efficiency. Computing and storing the algorithms else condition and carrying it down is clever and efficient.

Upvotes: 7

Related Questions