Reputation: 13329
I'm building an app that gets incoming SMSs, then based on a keyword, it looks to see if that keyword is associated with any campaigns that it is running. The way I'm doing it now is to load a list of keywords and possible spelling combinations, then when the SMS comes in, I look through all keywords and combinations to see if there is a match.
How would you do this not using this method, but by actually looking for words that might match another word.
Let's say the correct spelling is HAMSTER, normally I would give the campaign alternatives like HMSTER HIMSTER HAMSTAR HAMSTR HAMSTIR etc.
Is there a smart way of doing this?
HAMSTER
"hamstir".compare_to("hamster") ? match
EDIT:
How about 2 words? Say we know there are two words that need to match in the SMS:
correct for first word = THE FIRST WORD
correct for second word = AND SECOND WORD
SMS = FIRST WORD SECOND
EDIT:
Ideally people should SMS the words comma seperated, that whay I would know where to split and look for the words.
But what if they dont, like :
UNIQUE KEYWORD SECOND PARAMATER
How would I tell where the words split? The first word might be 3 words long and the second 3 or 1 or 2 etc.
In these examples, how would you use the techniques below to find the two words ?
Would you look twice ? one for each needed parameter or keyword?
Upvotes: 4
Views: 11454
Reputation: 414325
You could use a fuzzy matching and a named list with regex
library e.g., to find any phrase from a list with at most one error (insertion, deletion, substitution):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex as re # pip install regex
words = ["first word", "second word", "third"]
sms = u"junk Furst Word second Third"
for m in re.finditer(ur"(?fie)\L<words>{e<=1}", sms, words=words):
print(m[0]) # the match
print(m.span()) # return indexes where the match found in the sms
# to find out which of the words matched:
print(next(w for w in words
if re.match(ur"(?fi)(?:%s){e<=1}" % re.escape(w), m[0])))
Furst Word
(5, 14)
first word
Third
(22, 27)
third
Or you could iterate over the words directly:
for w in words:
for m in re.finditer(ur"(?fie)(?:%s){e<=1}" % re.escape(w), sms):
print(m[0])
print(m.span())
print(w)
It produces the same output as the first example.
Upvotes: 1
Reputation: 86884
What you're looking for is Levenshtein Distance.
Assuming your list of campaign isn't too large, you can calculate the distance between the input word and that of each campaign then select the one with the shortest. To filter out completely wrong words you might need to set a minimum acceptable distance and discard the input if the shortest is still beyond the limit.
To calculate the distance between two words, you can try one of these modules:
For example, using levenshtein.py
:
from levenshtein import levenshtein
campaigns = (
"HAMSTER",
"TWO WORDED",
"FRIDAY",
)
def get_campaign(word):
return min(campaigns, key=lambda x: levenshtein(word, x))
Usage:
>>> get_campaign("HAMSTA")
'HAMSTER'
>>> get_campaign("HAM WORDED")
'TWO WORDED'
>>> get_campaign("FROODY")
'FRIDAY'
>>> get_campaign("FRIDAY")
'FRIDAY'
Note that is a very simple-minded approach and will always return something even if the input is completely different.
Upvotes: 7
Reputation: 78610
The simplest solution is to use the difflib package, which has a get_close_matches
function for approximate string matching:
import difflib
difflib.get_close_matches(word, possibilities)
Upvotes: 11
Reputation: 113975
It seems to me that you're trying to build a spell checker. You could use minimum edit distance matching. Alternatively, look at Peter Norvig's python spell checker
Hope that helps
Upvotes: 1
Reputation: 16037
I use levenshtein distance to solve similar problem see http://en.wikipedia.org/wiki/Levenshtein_distance
def distance(u1, u2):
try:
s1 = unicode(u1)
s2 = unicode(u2)
except:
s1 = u1
s2 = u2
if len(s1) < len(s2):
return distance(u2, u1)
if not s1:
return len(s2)
previous_row = xrange(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
distance("hamstir", "hamster") < 3
True
distance("god", "hamster") < 3
False
Upvotes: 2