Edward Atkins
Edward Atkins

Reputation: 466

Using fuzzy regex (in Python) to correct spelling

I have an array of frequent words in the text I am an analyzing, and I intend to use regex fuzzing matching to replace any misspellings of them.

I know I could loop over them like:

import regex as re

edits = 1
my_arr = ['word1', 'word2', 'word3']
my_text = 'this is my text with wrd1 in it'

for word in my_arr:
    r_pattern = '(' + word + ')' + '){e<=' + str(edits) + '}'
    my_text = re.sub(r_pattern, word, my_text)

But is there a way to use regex.sub to do this with one line? i.e. so my pattern could look something like

r_pattern = '(word1|word2|word3){e<=1}'

Upvotes: 1

Views: 219

Answers (1)

pjmaracs
pjmaracs

Reputation: 118

Here is my solution

import regex as re

def repl(matchObj):
    return str(matchObj.lastgroup)

edits = 1
my_arr = ['word1', 'word2', 'word3']
my_text = 'this is my text with wrd3 in it'

r_pattern = ""
for i in range(len(my_arr)):
    if i == len(my_arr)-1:
        r_pattern += '(?P<' + my_arr[i] + '>' + my_arr[i] + '){e<=' + str(edits) + '}'
    else:
        r_pattern += '(?P<' + my_arr[i] + '>' + my_arr[i] + '){e<=' + str(edits) + '}|'

r = re.compile(r_pattern)
my_text = re.sub(r, repl, my_text)
print (my_text)

It uses the lastgroup attribute of the match object which tells you which group caused the substitution to trigger. This should scale well with a larger array if you need it to, assuming there isn't a limit on re.compile that will get in your way. Hope this helps. Python Doc with lastgroup: https://docs.python.org/2/library/re.html Handy regex editor to help with future problems: https://regex101.com

Upvotes: 1

Related Questions