JRR
JRR

Reputation: 588

Fuzzy matching strings embedded within strings

I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:

locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']

sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'

result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]

In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:

Milan

should match

Melan, Mlian, or Mlan but not Milano 

The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.

Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?

def find_keyword_comments(sents, locations, state_init):
    keywords = '|'.join(locations)
    keywords1 = '|'.join(state_init)
    word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
    word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
    newlist = filter(word.match, test_comments)
    newlist1 = filter(word1.match, test_comments)
    final = list(newlist) + list(newlist1)
    return final

Upvotes: 1

Views: 920

Answers (1)

WillMonge
WillMonge

Reputation: 1035

I would recommend you look at metrics for fuzzy matching, mainly the one you are interested is Levenshtein Distance (sometimes called the edit distance).

Here are some implementations in pure python, but you can leverage a few modules to make your life easier:

  • fuzzywuzzy is a very common (pip-installable) package which implements this distance for what they call the pure ratio. It provides a bit more functionality than you are maybe looking for (partial string matching, ignoring punctuation marks, token order insensitivity...). The only drawback is that the ratio takes into account the length of the string as well. See this response for further basic usage

    from fuzzywuzzy import fuzz
    fuzz.ratio("this is a test", "this is a test!")  # 96
    
  • python-Levenshtein is a pretty fast package because it is basically a wrapper in python to the C library underneath. The documentation is not the nicest, but should work. It is now back in the PyPI index so it is pip installable.

Upvotes: 2

Related Questions