Harry
Harry

Reputation: 13329

Matching incorrectly spelt words with correct ones in python

I'm building an app that gets incoming SMSs, then based on a keyword, it looks to see if that keyword is associated with any campaigns that it is running. The way I'm doing it now is to load a list of keywords and possible spelling combinations, then when the SMS comes in, I look through all keywords and combinations to see if there is a match.

How would you do this not using this method, but by actually looking for words that might match another word.

Let's say the correct spelling is HAMSTER, normally I would give the campaign alternatives like HMSTER HIMSTER HAMSTAR HAMSTR HAMSTIR etc.

Is there a smart way of doing this?

HAMSTER

"hamstir".compare_to("hamster") ? match

EDIT:

How about 2 words? Say we know there are two words that need to match in the SMS:

correct for first word = THE FIRST WORD

correct for second word = AND SECOND WORD

SMS = FIRST WORD SECOND

EDIT:

Ideally people should SMS the words comma seperated, that whay I would know where to split and look for the words.

But what if they dont, like :

UNIQUE KEYWORD SECOND PARAMATER

How would I tell where the words split? The first word might be 3 words long and the second 3 or 1 or 2 etc.

In these examples, how would you use the techniques below to find the two words ?

Would you look twice ? one for each needed parameter or keyword?

Upvotes: 4

Views: 11454

Answers (5)

jfs
jfs

Reputation: 414325

You could use a fuzzy matching and a named list with regex library e.g., to find any phrase from a list with at most one error (insertion, deletion, substitution):

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex as re # pip install regex

words = ["first word", "second word", "third"]
sms = u"junk Furst Word second Third"

for m in re.finditer(ur"(?fie)\L<words>{e<=1}", sms, words=words):
    print(m[0]) # the match
    print(m.span()) # return indexes where the match found in the sms
    # to find out which of the words matched:
    print(next(w for w in words
               if re.match(ur"(?fi)(?:%s){e<=1}" % re.escape(w), m[0])))

Output

Furst Word
(5, 14)
first word
Third
(22, 27)
third

Or you could iterate over the words directly:

for w in words:
    for m in re.finditer(ur"(?fie)(?:%s){e<=1}" % re.escape(w), sms):
        print(m[0])
        print(m.span())
        print(w)

It produces the same output as the first example.

Upvotes: 1

Shawn Chin
Shawn Chin

Reputation: 86884

What you're looking for is Levenshtein Distance.

Assuming your list of campaign isn't too large, you can calculate the distance between the input word and that of each campaign then select the one with the shortest. To filter out completely wrong words you might need to set a minimum acceptable distance and discard the input if the shortest is still beyond the limit.

To calculate the distance between two words, you can try one of these modules:

For example, using levenshtein.py:

from levenshtein import levenshtein
campaigns = (
    "HAMSTER",
    "TWO WORDED",
    "FRIDAY",
)

def get_campaign(word):
  return min(campaigns, key=lambda x: levenshtein(word, x))

Usage:

>>> get_campaign("HAMSTA")
'HAMSTER'
>>> get_campaign("HAM WORDED")
'TWO WORDED'
>>> get_campaign("FROODY")
'FRIDAY'
>>> get_campaign("FRIDAY")
'FRIDAY'

Note that is a very simple-minded approach and will always return something even if the input is completely different.

Upvotes: 7

David Robinson
David Robinson

Reputation: 78610

The simplest solution is to use the difflib package, which has a get_close_matches function for approximate string matching:

import difflib
difflib.get_close_matches(word, possibilities)

Upvotes: 11

inspectorG4dget
inspectorG4dget

Reputation: 113975

It seems to me that you're trying to build a spell checker. You could use minimum edit distance matching. Alternatively, look at Peter Norvig's python spell checker

Hope that helps

Upvotes: 1

bpgergo
bpgergo

Reputation: 16037

I use levenshtein distance to solve similar problem see http://en.wikipedia.org/wiki/Levenshtein_distance

def distance(u1, u2):
    try:
        s1 = unicode(u1)    
        s2 = unicode(u2)
    except:
        s1 = u1
        s2 = u2        
    if len(s1) < len(s2):
        return distance(u2, u1)
    if not s1:
        return len(s2)

    previous_row = xrange(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]


distance("hamstir", "hamster") < 3
True
distance("god", "hamster") < 3
False

Upvotes: 2

Related Questions