roysterphil
roysterphil

Reputation: 43

Match word but ignore end-of-sentence word

My regex search is matching a word that is at the end of the sentence.

>>> needle = 'miss'
>>> needle_regex = r"\b" + needle + r"\b"
>>> haystack = 'Cleveland, Miss. - This is the article'
>>> re.search(needle_regex, haystack, re.IGNORECASE)
<_sre.SRE_Match object; span=(10, 14), match='Miss'>

In this case, "Miss." is actually short for Mississippi and it's not a match. How do I ignore end-of-sentence words but also ensure that

>>> haystack = "Website Miss.com some more text here"

would still be a match.

Upvotes: 0

Views: 92

Answers (1)

wp78de
wp78de

Reputation: 18980

As already mentioned, language is fuzzy and regex is not a natural language processing tool. A feasible solution could be to exclude matches that have a punctuation mark using the regex \p{P} Unicode category followed by a space, e.g.

(?!\bmiss\p{P}\s)\bmiss\b

Demo *PCRE

However, to take advantage of Unicode codepoint properties with the \p{} syntax we have to use the regex module (an alternative to the standard re module) that support that feature.

Code Sample:

import regex as re

regex = r"(?!\bmiss\p{P}\s)\bmiss\b"
test_str = ("Cleveland, Miss. - This is the article\n"
    "Website Miss.com")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE | re.UNICODE)
for match in matches:    
    print ("Match at {start}-{end}: {match}".format(start = match.start(), end = match.end(), match = match.group()))

Upvotes: 1

Related Questions