Ingrid
Ingrid

Reputation: 526

Python regex to avoid matching word followed by multiple conditions

I want to match uppercase character words that are in the middle of a sentence, using Python 3. This is my current regex:

.+?\b([A-Z]+)\b(?=[^.!?][^ ])

So I want to avoid matching words that are followed by this set of characters [^.!?] and a space. But this expression also matches a word followed by a period and no space. What is my mistake?

I.e, at the moment I get the same result using re.findall() with and without a space at the end of the searched string:

>>> re.findall(r'.+?\b([A-Z]+)\b(?=[^.!?][^ ])','NO YES YES YES YES NO. ')
['YES', 'YES', 'YES', 'YES']
>>> re.findall(r'.+?\b([A-Z]+)\b(?=[^.!?][^ ])','NO YES YES YES YES NO.')
['YES', 'YES', 'YES', 'YES']

Upvotes: 2

Views: 699

Answers (2)

LetzerWille
LetzerWille

Reputation: 5658

print(re.findall(r'[^A-Z](.+)[^A-Z]\S+\s*$','NO YES YES YES YES NO. '))

['YES YES YES YES']

print(re.findall(r'[^A-Z](.+)[^A-Z]\S+\s*$','NO YES YES YES YES NO.'))

['YES YES YES YES']

Upvotes: 0

anubhava
anubhava

Reputation: 784998

Try this regex with negative lookahead:

r'(?!^)\b([A-Z]+)\b(?![.!?] )'

(?!^) will skip the word at start of sentence.

(?![.!?] ) will fail the match when words are followed by one of those chars followed by a space.

Examples:

>>> re.findall(r'(?!^)\b([A-Z]+)\b(?![.!?] )','NO YES YES YES YES NO.')
['YES', 'YES', 'YES', 'YES', 'NO']

>>> re.findall(r'(?!^)\b([A-Z]+)\b(?![.!?] )','NO YES YES YES YES NO. ')
['YES', 'YES', 'YES', 'YES']

Upvotes: 1

Related Questions