Fuzzy matching strings embedded within strings

Question

I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:

locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']

sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'

result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]

In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:

Milan

should match

Melan, Mlian, or Mlan but not Milano

The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.

Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?

def find_keyword_comments(sents, locations, state_init):
    keywords = '|'.join(locations)
    keywords1 = '|'.join(state_init)
    word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
    word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
    newlist = filter(word.match, test_comments)
    newlist1 = filter(word1.match, test_comments)
    final = list(newlist) + list(newlist1)
    return final

Fuzzy matching strings embedded within strings

Answers (1)

Related Questions