Reputation: 1038
I am using a regex to find the string 'my car' and detect up to four words before it. My reference text is:
my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly.
if I use the regex:
re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
I am getting the expected results.For example: house is painted white, my car
if I use the regex:
re.finditer(r'(?:\w+\b){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
I am getting only: 'my car' and 'my car' That is, I am not getting up to four words before it. Why I cannot use the \b to match the words in the group {0,4}?
Upvotes: 3
Views: 524
Reputation: 626929
Because \b
is a zero-width assertion word boundary matching a location between the start of string and a word char, between a non-word char and a word char, between a word char and a non-word char and between a word char and end of string. It does not consume the text.
The (?:\w+\b){0,4}
just matches an empty string since there is no 1+ word chasrs followed with a word boundary before my car
.
Instead, you may want to match 1+ non-word chars that will effectively imitate a word boundary:
(?:\w+\W+){0,4}my car\b
See the regex demo
Upvotes: 2
Reputation: 43169
You could use:
(?:\b\w+\W+){4}
\b(?:my\ car)\b
Python
this will be:
import re
rx = re.compile(r'''
(?:\b\w+\W+){0,4}
\b(?:my\ car)\b
''', re.VERBOSE)
string = """
my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly.
"""
words = rx.findall(string)
print(words)
# ['house is painted white, my car', 'the road, I drive my car']
Upvotes: 2