Improving performance of levenshtein distance in numpy

Question

I have the following function:

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    matrix[: , 0] = np.arange(size_x)
    matrix[0, :] = np.arange(size_y)

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    return (matrix[size_x - 1, size_y - 1])

And I want to apply it to many pairs of string, in order to do it as fast as possible I want to remove the for loops in it and replace them by some vectorization, but I couldn't find a good way to do it, any ideas?

user8531240 · Accepted Answer

It is better to use already written python mudule to solve your problem rather than reinventing the wheel, as for me. You will save a lot of time.

Open cmd and write pip install python-Levenshtein, or if you use git go to your project folder and type git clone https://github.com/ztane/python-Levenshtein.git (github link). Then onen python file and:

import Levenshtein
Levenshtein.distance('Levenshtein', 'Lenvinsten')
# output will be 4
# ... your code ...

But if you need to write it manually you can see how it is written by other developers or examples of using Levenshtein module in the same link.

Improving performance of levenshtein distance in numpy

Answers (1)

Related Questions