Reputation: 1024
I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".
The tower is 100 meters tall.
Here's the pattern that I tried which didn't work:
\d+\s*(\b.+\b)
But this one did:
\d+\s*(\w+)
The first incorrect pattern matched this:
The tower is 100 meters tall.
I didn't want the word "tall" to be matched. I expected the following behavior:
\d+
match one or more occurrence of a digit
\s*
match any or no spaces
(
start new capturing group
\b
find the word/non-word boundary
.+
match 1 or more of everything except new line
\b
find the next word/non-word boundary
)
stop capturing group
The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)
?
This is Python flavored
Here's the regex101 test link of the above regex.
Upvotes: 2
Views: 241
Reputation: 70732
It didn't stop because +
is greedy by default, you want +?
for a non-greedy match.
A concise explanation — *
and +
are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.
You need to follow these operators with ?
for a non-greedy match, going in the above order it would be (*?
) "zero or more" or (+?
) "one or more" — but preferably "as few as possible".
Also a word boundary \b
matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b
around the .
if you're unclear what's in between the boundaries.
Upvotes: 8
Reputation: 5395
It match both words because .
match (nearly) all characters, so also space character, and because +
is greedy, so it will match as much as it could. If you would use \w
instead of .
it would work (because \w
match only word characters - a-zA-Z_0-9).
Upvotes: 1