Python 3 regex word boundary unclear

Question

I am using a regex to find the string 'my car' and detect up to four words before it. My reference text is:

my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly.

if I use the regex:

re.finditer(r'(?:\w+[ 	,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)

I am getting the expected results.For example: house is painted white, my car

if I use the regex:

re.finditer(r'(?:\w+\b){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)

I am getting only: 'my car' and 'my car' That is, I am not getting up to four words before it. Why I cannot use the \b to match the words in the group {0,4}?

Wiktor Stribiżew · Accepted Answer

Because \b is a zero-width assertion word boundary matching a location between the start of string and a word char, between a non-word char and a word char, between a word char and a non-word char and between a word char and end of string. It does not consume the text.

The (?:\w+\b){0,4} just matches an empty string since there is no 1+ word chasrs followed with a word boundary before my car.

Instead, you may want to match 1+ non-word chars that will effectively imitate a word boundary:

(?:\w+\W+){0,4}my car\b

See the regex demo

Python 3 regex word boundary unclear

Answers (2)

Related Questions