Reputation: 588
I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:
locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']
sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'
result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]
In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:
Milan
should match
Melan, Mlian, or Mlan but not Milano
The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.
Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?
def find_keyword_comments(sents, locations, state_init):
keywords = '|'.join(locations)
keywords1 = '|'.join(state_init)
word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
newlist = filter(word.match, test_comments)
newlist1 = filter(word1.match, test_comments)
final = list(newlist) + list(newlist1)
return final
Upvotes: 1
Views: 920
Reputation: 1035
I would recommend you look at metrics for fuzzy matching, mainly the one you are interested is Levenshtein Distance (sometimes called the edit distance).
Here are some implementations in pure python, but you can leverage a few modules to make your life easier:
fuzzywuzzy
is a very common (pip-installable) package which implements this distance for what they call the pure ratio. It provides a bit more functionality than you are maybe looking for (partial string matching, ignoring punctuation marks, token order insensitivity...). The only drawback is that the ratio
takes into account the length of the string as well. See this response for further basic usage
from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!") # 96
python-Levenshtein
is a pretty fast package because it is basically a wrapper in python to the C library underneath. The documentation is not the nicest, but should work. It is now back in the PyPI index so it is pip installable.
Upvotes: 2