Minimum Levenshtein distance across multiple words

Question

I am trying to do some string matching using the Levenshtein algorithm for closest words on businesses. (In python but language won't make a huge difference)

An example query would be

search = 'bna' lat & lon are close by the result I am looking for.

There is a pub right by the latitude and longitude called BNA Brewing Co. by searching BNA my hopes would be that that shows up first (as bna == bna)

I have tried two different way

m = min([editdistance.eval(search, place_split) for place_split in place.name.split(' ')
                     if place_split not in string.punctuation])

returns without ranking based on geographical distance, only levenshtein distance

Coffee & Books In Town Center
Talk 'n' Coffee
Raggedy Ann & Andy's

and with taking into account geographical distance, secondary to levenshtein

Shapers Hair Salon & Spa
Amora Day Spa
Pure Esthetics and Micro-Pigmentation

And

m = editdistance.eval(search, place.name)

The first one returns without ranking based on geographical distance, only levenshtein distance

KFC
MOO
A&W

and with taking into account geographical distance, secondary to levenshtein

A&W
A&W
KFC

So you can see that neither way are returning anything close to BNA Brewing Co. What kind of logic do I have to use to get it to return something when the search terms exactly matches one of the place names in my database?

bunji · Accepted Answer

Recall that Levenshtein distances count the number of substitutions, additions and deletions required to transform one string into another. Because of this, they often are minimized when comparing strings of similar length (because even if a lot of substitutions are required, you don't have to add or remove a bunch of characters). You can see this playing out in your second example where your best outputs all are the same length as your search string (len("bna") == len("A&W")).

If your search string is always going to be a single word, then your idea to calculate the distance for each word in the string is a good one since each word is more likely to be a similar length to your search string. However, currently you are doing a case sensitive comparison which means that editdistance.eval('bna', 'BNA') == 3 which I'm guessing you don't want.

try:

m = min([editdistance.eval(search.lower(), place_split.lower()) for place_split in place.name.split(' ') if place_split not in string.punctuation])

which should give you a case insensitive search.

Minimum Levenshtein distance across multiple words

Answers (1)

Related Questions