Harry
Harry

Reputation: 13329

Comparing strings in python to find errors

I have a string that is the correct spelling of a word:

FOO

I would allow someine to mistype the word in such ways:

FO, F00, F0O ,FO0

Is there a nice way to check for this ? Lower case should also be seen as correct, or convert to upper case. What ever would be the prettiest.

Upvotes: 2

Views: 927

Answers (3)

pyInTheSky
pyInTheSky

Reputation: 1469

you can use the 're' module

re.compile(r'f(o|0)+',re.I) #ignore case

you can use curly braces to limit the number of occurrences too. you can also get 'fancy' and define your 'leet' sets and add them in w/ %s

as in:

ay = '(a|4|$)'
oh = '(o,0,\))'
re.compile(r'f%s+' % (oh),re.I)

Upvotes: 1

jterrace
jterrace

Reputation: 67113

The builtin module difflib has a get_close_matches function.

You can use it like this:

>>> import difflib
>>> difflib.get_close_matches('FO', ['FOO', 'BAR', 'BAZ'])
['FOO']
>>> difflib.get_close_matches('F00', ['FOO', 'BAR', 'BAZ'])
[]
>>> difflib.get_close_matches('F0O', ['FOO', 'BAR', 'BAZ'])
['FOO']
>>> difflib.get_close_matches('FO0', ['FOO', 'BAR', 'BAZ'])
['FOO']

Notice that it doesn't match one of your cases. You could lower the cutoff parameter to get a match:

>>> difflib.get_close_matches('F00', ['FOO', 'BAR', 'BAZ'], cutoff=0.3)
['FOO']

Upvotes: 2

Mark Byers
Mark Byers

Reputation: 838806

One approach is to calculate the edit distance between the strings. You can for example use the Levenshtein distance, or invent your own distance function that considers 0 and O more close than 0 and P, for example.

Another is to transform each word into a canonical form, and compare canonical forms. You can for example convert the string to uppercase, replace all 0s with Os, 1s with Is, etc., then remove duplicated letters.

>>> import itertools
>>> def canonical_form(s):
        s = s.upper()
        s = s.replace('0', 'O')
        s = s.replace('1', 'I')
        s = ''.join(k for k, g in itertools.groupby(s))
        return s
>>> canonical_form('FO')
'FO'
>>> canonical_form('F00')
'FO'
>>> canonical_form('F0O')
'FO'

Upvotes: 6

Related Questions