ckot
ckot

Reputation: 891

looking for python library which can perform levenshtein/other edit distance at word-level

I've seen a bunch of similar questions on SO/elsewhere but none of the answers quite satisfy my needs, so I don't think this is a dup.

Also, I totally know how to implement this myself, but I'm trying not to have to re-invent the wheel.

Does anyone know any python packages which can perform levenshtein/other edit-distance comparing 2 lists of words (I've found a few), but also allow one to specify your own costs for insertion, deletion, substitution, and transpositions?

basically, I want the distances computed to be the number of edits on words in the sentences, not on the number of characters the sentences differ by.

I'm trying to replace a custom python extension module which is actually written in C, using python2's C api. I could re-write in either pure-python or cython, but I'd rather simply add a dependency to the project. The only problem is that this code allows one to specify your own costs for the various options, and I haven't found a package which allows this so far.

Upvotes: 15

Views: 28206

Answers (5)

Tanan
Tanan

Reputation: 11

the python distancia package does this very well:

from distancia import Levenshtein
s1 = 'WAKA WAKA QB WTF BBBQ WAKA LOREM IPSUM WAKA'
s2 = 'WAKA OMFG QB WTF WAKA WAKA LOREM IPSUM WAKA'
distance = Levenshtein().levenshtein_distance_words(s1, s2)
print(distance)

output:

2.0

Upvotes: 1

Alex
Alex

Reputation: 534

Maybe fuzzywuzzy library. It is built on top of difflib library, python-Levenshtein is used for speed.

https://pypi.org/project/fuzzywuzzy/

Upvotes: 0

Charlie Parker
Charlie Parker

Reputation: 5231

In case you need levenshtein token edit normalized:

def token_edit_levenstein_similarity_normalized(text1: str, text2: str) -> float:
    """
    Compute the normalized levenstein distance between two texts.
    """
    import nltk
    return 1 - nltk.edit_distance(text1, text2) / max(len(text1), len(text2))
    
def test_token_edit():
    import nltk

    s1 = 'WAKA WAKA QB WTF BBBQ WAKA LOREM IPSUM WAKA'.split()
    s2 = 'WAKA OMFG QB WTF WAKA WAKA LOREM IPSUM WAKA'.split()
    print(s1)
    print(s2)
    print(nltk.edit_distance(s1, s2))
    print(token_edit_levenstein_similarity_normalized(s1, s2))

output

['WAKA', 'WAKA', 'QB', 'WTF', 'BBBQ', 'WAKA', 'LOREM', 'IPSUM', 'WAKA']
['WAKA', 'OMFG', 'QB', 'WTF', 'WAKA', 'WAKA', 'LOREM', 'IPSUM', 'WAKA']
2
0.7777777777777778

Upvotes: 2

Catalina Chircu
Catalina Chircu

Reputation: 1572

Here is one library which is said to be fast, and computer various types of word distance, including Levenshtein:

https://pypi.org/project/python-Levenshtein/

You should also try Hamming distance, less memory and time consuming than Levenshtein.

Upvotes: 4

vurmux
vurmux

Reputation: 10020

NLTK has the function named edit_distance. It calculates the Levenshtein distance between two strings. But it works good with lists of strings too:

import nltk

s1 = 'WAKA WAKA QB WTF BBBQ WAKA LOREM IPSUM WAKA'.split()
s2 = 'WAKA OMFG QB WTF WAKA WAKA LOREM IPSUM WAKA'.split()
print(s1)
print(s2)
print(nltk.edit_distance(s1, s2))
['WAKA', 'WAKA', 'QB', 'WTF', 'BBBQ', 'WAKA', 'LOREM', 'IPSUM', 'WAKA']
['WAKA', 'OMFG', 'QB', 'WTF', 'WAKA', 'WAKA', 'LOREM', 'IPSUM', 'WAKA']

2

Upvotes: 23

Related Questions