Reputation: 4298
I am using Python's re
module to capture all modifiers of word color
in Am. English (AmE) and Br. English (BrE). I successfully captured almost all words, with the exception of words that end with apostrophe. E.g. colors'
This problem is from Watt's Beginning Reg Exp book.
Here's sample text:
Red is a color.
His collar is too tight or too colouuuurful.
These are bright colours.
These are bright colors.
Calorific is a scientific term.
“Your life is very colorful,” she said.
color (U.S. English, singular noun)
colour (British English, singular noun)
colors (U.S. English, plural noun)
colours (British English, plural noun)
color’s (U.S. English, possessive singular)
colour’s (British English, possessive singular)
colors’ (U.S. English, possessive plural)
colours’ (British English, possessive plural)
Here's my regex: \bcolou?r(?:[a-zA-Z’s]+)?\b
Explanation:
\b # Start at word boundary
colou?r #u is optional for AmE
(?: #non-capturing group
[a-zA-Z’s]+ #color could be followed by modifier (e.g.ful, or apostrophe)
)? #End non-capturing group; these letters are optional
\b # End at word boundary
The issue is that colors’
and colours’
are matched until s
. Apostrophe is ignored. Can someone please explain what is wrong with my code? I researched this on SO Regex Apostrophe how to match?, and the problems there are about escaping '
and "
.
Here's Regex101
Thanks in advance.
Upvotes: 1
Views: 610
Reputation: 856
The problem is the ending \b
. by definition it says
\b Matches, without consuming any characters, immediately between a character matched by \w and a character not matched by \w (in either order). It cannot be used to separate non words from words.
’
is not in \w
group.
Try remove the ending it: \bcolou?r(?:[a-zA-Z’s]+)?
Upvotes: 0
Reputation: 370949
The problem is that \b
is a word boundary, and with ...lors’
, the position between the ’
and the following space is not a word boundary, because neither the ’
nor the space are word characters. Instead of \b
, use lookahead for a space, a period, a comma, or whatever else may come afterwards:
\bcolou?r(?:[a-zA-Z’s]+)?(?=[ .,])
https://regex101.com/r/lB49Nr/3
Upvotes: 2