safwan
safwan

Reputation: 97

Levenshtein distance in python giving only 1 as edit distance

I have a python program to read two lists (one with errors and other with correct data). Every element in my list with error needs to be compared with every element in my correct list. After comparing i get all the edit distance between every compared pair. now i can find the least edit distance for the given error data and thru get my correct data.

I am trying to use levenshtein distance to calculate the edit distance but its returning all edit distance as 1 even which is wrong.

This means the code for calculating levenshtein distance is not correct. I am struggling to find a fix for this. HELP!

My Code

import csv

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

if __name__ == "__main__":

    with open("all_correct_promo.csv","rb") as file1:
        reader1 = csv.reader(file1)
        correctPromoList = list(reader1)
        #print correctPromoList

    with open("all_extracted_promo.csv","rb") as file2:
        reader2 = csv.reader(file2)
        extractedPromoList = list(reader2)
        #print extractedPromoList

    incorrectPromo = []
    count = 0
    for extracted in extractedPromoList:
        if(extracted not in correctPromoList):
            incorrectPromo.append(extracted)
        else:
            count = count + 1
    #print incorrectPromo

    for promos in incorrectPromo:
        for correctPromo in correctPromoList:
            distance = lev(promos,correctPromo)
            print promos, correctPromo , distance

Upvotes: 0

Views: 1498

Answers (2)

caspillaga
caspillaga

Reputation: 563

The implementation is correct. I tested this:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)


print lev('abcde','bc') # prints 3, which is correct
print lev('abc','bc') # prints 1, which is correct

Your problem, as I noticed by your comments, is probably when you call the method:

a = ['NSP-212690']
b = ['FE SV X']

print lev(a,b) # prints 1 which is incorrect because you are comparing arrays, not strings
print lev(a[0],b[0]) # prints 10, which is correct

So, what you can do is:

Before the call to "lev(a,b)", extract the first element of each array

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

a = ['NSP-212690']
b = ['FE SV X']
a = a[0] # this is the key part
b = b[0] # and this
print lev(a,b) # prints 10, which is correct

Anyway, I would not recommend you that recursive implementation, because the performance is veeeery poor

I would recommend this implementation instead (source: wikipedia-levenshtein)

def lev(seq1, seq2):
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
    return thisrow[len(seq2) - 1]

or maybe this slightly modified version:

def lev(seq1, seq2):
    if not a: return len(b)
    if not b: return len(a)
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
    return thisrow[len(seq2) - 1]

Upvotes: 1

Sebastian Wozny
Sebastian Wozny

Reputation: 17526

There is a Python package available that implements the levenshtein distance : python-levenshtein

To install it:

pip install python-levenshtein

To use it:

>>> import Levenshtein
>>> string1 = 'dsfjksdjs'
>>> string2 = 'dsfiksjsd'
>>> print Levenshtein.distance(string1, string2)
3

Upvotes: 1

Related Questions