Reputation: 43
My regex search is matching a word that is at the end of the sentence.
>>> needle = 'miss'
>>> needle_regex = r"\b" + needle + r"\b"
>>> haystack = 'Cleveland, Miss. - This is the article'
>>> re.search(needle_regex, haystack, re.IGNORECASE)
<_sre.SRE_Match object; span=(10, 14), match='Miss'>
In this case, "Miss." is actually short for Mississippi and it's not a match. How do I ignore end-of-sentence words but also ensure that
>>> haystack = "Website Miss.com some more text here"
would still be a match.
Upvotes: 0
Views: 92
Reputation: 18980
As already mentioned, language is fuzzy and regex is not a natural language processing tool. A feasible solution could be to exclude matches that have a punctuation mark using the regex \p{P} Unicode category followed by a space, e.g.
(?!\bmiss\p{P}\s)\bmiss\b
Demo *PCRE
However, to take advantage of Unicode codepoint properties with the \p{} syntax we have to use the regex module (an alternative to the standard re module) that support that feature.
Code Sample:
import regex as re
regex = r"(?!\bmiss\p{P}\s)\bmiss\b"
test_str = ("Cleveland, Miss. - This is the article\n"
"Website Miss.com")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE | re.UNICODE)
for match in matches:
print ("Match at {start}-{end}: {match}".format(start = match.start(), end = match.end(), match = match.group()))
Upvotes: 1