andreSmol
andreSmol

Reputation: 1038

Python 3 regex word boundary unclear

I am using a regex to find the string 'my car' and detect up to four words before it. My reference text is:

my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly.

if I use the regex:

re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)

I am getting the expected results.For example: house is painted white, my car

if I use the regex:

re.finditer(r'(?:\w+\b){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)

I am getting only: 'my car' and 'my car' That is, I am not getting up to four words before it. Why I cannot use the \b to match the words in the group {0,4}?

Upvotes: 3

Views: 524

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

Because \b is a zero-width assertion word boundary matching a location between the start of string and a word char, between a non-word char and a word char, between a word char and a non-word char and between a word char and end of string. It does not consume the text.

The (?:\w+\b){0,4} just matches an empty string since there is no 1+ word chasrs followed with a word boundary before my car.

Instead, you may want to match 1+ non-word chars that will effectively imitate a word boundary:

(?:\w+\W+){0,4}my car\b

See the regex demo

Upvotes: 2

Jan
Jan

Reputation: 43169

You could use:

(?:\b\w+\W+){4}
\b(?:my\ car)\b

See a demo on regex101.com.


In Python this will be:

import re

rx = re.compile(r'''
                (?:\b\w+\W+){0,4}
                \b(?:my\ car)\b
                ''', re.VERBOSE)

string = """
my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly.
"""
words = rx.findall(string)
print(words)
# ['house is painted white, my car', 'the road, I drive my car']

Upvotes: 2

Related Questions