watchtower
watchtower

Reputation: 4298

Capturing apostrophe using regex

I am using Python's re module to capture all modifiers of word color in Am. English (AmE) and Br. English (BrE). I successfully captured almost all words, with the exception of words that end with apostrophe. E.g. colors' This problem is from Watt's Beginning Reg Exp book.

Here's sample text:

Red is a color.
His collar is too tight or too colouuuurful.
These are bright colours.
These are bright colors.
Calorific is a scientific term.
“Your life is very colorful,” she said.
color (U.S. English, singular noun)
colour (British English, singular noun)
colors (U.S. English, plural noun)
colours (British English, plural noun)
color’s (U.S. English, possessive singular)
colour’s (British English, possessive singular)
colors’ (U.S. English, possessive plural)
colours’ (British English, possessive plural)

Here's my regex: \bcolou?r(?:[a-zA-Z’s]+)?\b

Explanation:

\b                 # Start at word boundary
colou?r            #u is optional for AmE
    (?:            #non-capturing group
    [a-zA-Z’s]+    #color could be followed by modifier (e.g.ful, or apostrophe)
    )?             #End non-capturing group; these letters are optional
\b                 # End at word boundary

The issue is that colors’ and colours’ are matched until s. Apostrophe is ignored. Can someone please explain what is wrong with my code? I researched this on SO Regex Apostrophe how to match?, and the problems there are about escaping ' and ".

Here's Regex101

Thanks in advance.

Upvotes: 1

Views: 610

Answers (2)

digitake
digitake

Reputation: 856

The problem is the ending \b. by definition it says

\b Matches, without consuming any characters, immediately between a character matched by \w and a character not matched by \w (in either order). It cannot be used to separate non words from words.

is not in \w group. Try remove the ending it: \bcolou?r(?:[a-zA-Z’s]+)?

Upvotes: 0

CertainPerformance
CertainPerformance

Reputation: 370949

The problem is that \b is a word boundary, and with ...lors’, the position between the and the following space is not a word boundary, because neither the nor the space are word characters. Instead of \b, use lookahead for a space, a period, a comma, or whatever else may come afterwards:

\bcolou?r(?:[a-zA-Z’s]+)?(?=[ .,])

https://regex101.com/r/lB49Nr/3

Upvotes: 2

Related Questions