Reputation: 83
I've just started learning regex and I've been stuck on this lesson for a while now.
I don't quite understand why the below 2 won't work.
^(.+(\d+))$
^([a-zA-Z_]+(\d+))$
Upvotes: 2
Views: 3485
Reputation:
For this ^(.+(\d+))$
If we put a capture group around (.+)
From the output we see that for .+
, the engine initially matches
the entire string Jan 1987
then backtracks 1 character at a time until
it can satisfy the next sub-expression (\d+)
.
It finds character 7
and sees that satisfies the expression.
Its done its job, all sub-expression's are satisfied, so it stops.
That's typical behavior of greedy quantifiers. Some things that could disrupt this behavior:
Add a non-greedy qualifier ?
to the expression:
Looks like this (.+?)
. What this does is instead of initially matching
the entire string, it incrementally matches 1 character at a time.
Each time it matches a character it checks the next character to see if it
would match the next sub-expression \d+
. Since it does, it leaves the current sub-expression .+?
and continues with the next one \d+
.
The process continues onto the next sub-expression, etc..
Any time it fails along the way, the engine goes back to the previous successful sub-expression, at its previous successful match position,
then decrements that match position, then repeats the whole process.
It could go all the way back to the very first sub-expression with this
process... This is called backtracking.
Keep the greedy sub-expression .+
but at certain points add
known literal character(s) just before the next sub-expression.
These are waypoints for the engine to anchor on and are called pseudo-anchors.
It disrupts backtracking. For example, you could have added a space literal
just before the \d+
sub-expression, like ^(.+ (\d+))$
. This forces the
engine to backtrack from the last digit, back until it finds a space, letting \d+
consume all the year digits.
Stay greedy, but reduce the allowed characters in sub-expressions.
Instead of using the Dot meta-character (matches any character), specify
a class of limited characters.
In general, use the Dot meta-character where either you don't care what is
there, or don't know what's there. It will always be faster than classes.
However, when using this, try to have a strategy of pseudo-anchors in the
following sub-expression. This will allow the engine to zero in, where it
matches at the right frame point for your subject text.
^
( # (1 start)
( .+ ) # (2)
( \d+ ) # (3)
) # (1 end)
$
Output:
** Grp 0 - ( pos 0 , len 8 )
Jan 1987
** Grp 1 - ( pos 0 , len 8 )
Jan 1987
** Grp 2 - ( pos 0 , len 7 )
Jan 198
** Grp 3 - ( pos 7 , len 1 )
7
Upvotes: 1
Reputation: 5456
Both work ok, but you need a space before the nested group.
Your regex would work if the text was for example: Jan1987
. But the examples in the link you posted are like: Jan 1987
Try:
^(.+ (\d+))$
^([a-zA-Z_]+ (\d+))$
Upvotes: 1