I've just started learning regex and I've been stuck on this lesson for a while now. I don't quite understand why the below 2 won't work. ^(.+(\d+))$ ^([a-zA-Z_]+(\d+))$

user3204597

Reputation: 83

Regex capture group nesting

I've just started learning regex and I've been stuck on this lesson for a while now.

enter image description here

I don't quite understand why the below 2 won't work.

^(.+(\d+))$

^([a-zA-Z_]+(\d+))$

Upvotes: 2

Answers (2)

user557597

Reputation:

For this ^(.+(\d+))$
If we put a capture group around (.+)
From the output we see that for .+, the engine initially matches
the entire string Jan 1987 then backtracks 1 character at a time until
it can satisfy the next sub-expression (\d+).
It finds character 7 and sees that satisfies the expression.
Its done its job, all sub-expression's are satisfied, so it stops.

That's typical behavior of greedy quantifiers. Some things that could disrupt this behavior:

Add a non-greedy qualifier ? to the expression:
Looks like this (.+?). What this does is instead of initially matching
the entire string, it incrementally matches 1 character at a time.
Each time it matches a character it checks the next character to see if it
would match the next sub-expression \d+. Since it does, it leaves the current sub-expression .+? and continues with the next one \d+.
The process continues onto the next sub-expression, etc..
Any time it fails along the way, the engine goes back to the previous successful sub-expression, at its previous successful match position,
then decrements that match position, then repeats the whole process.
It could go all the way back to the very first sub-expression with this
process... This is called backtracking.
Keep the greedy sub-expression .+ but at certain points add
known literal character(s) just before the next sub-expression.
These are waypoints for the engine to anchor on and are called pseudo-anchors.
It disrupts backtracking. For example, you could have added a space literal
just before the \d+ sub-expression, like ^(.+ (\d+))$. This forces the
engine to backtrack from the last digit, back until it finds a space, letting \d+ consume all the year digits.
Stay greedy, but reduce the allowed characters in sub-expressions.
Instead of using the Dot meta-character (matches any character), specify
a class of limited characters.

In general, use the Dot meta-character where either you don't care what is
there, or don't know what's there. It will always be faster than classes.
However, when using this, try to have a strategy of pseudo-anchors in the
following sub-expression. This will allow the engine to zero in, where it
matches at the right frame point for your subject text.

 ^ 
 (                             # (1 start)
      ( .+ )                        # (2)
      ( \d+ )                       # (3)
 )                             # (1 end)
 $

Output:

  **  Grp 0 -  ( pos 0 , len 8 ) 
 Jan 1987  
  **  Grp 1 -  ( pos 0 , len 8 ) 
 Jan 1987  
  **  Grp 2 -  ( pos 0 , len 7 ) 
 Jan 198  
  **  Grp 3 -  ( pos 7 , len 1 ) 
 7

Upvotes: 1

JosEduSol

Reputation: 5456

Both work ok, but you need a space before the nested group.

Your regex would work if the text was for example: Jan1987. But the examples in the link you posted are like: Jan 1987

Try:

^(.+ (\d+))$

^([a-zA-Z_]+ (\d+))$

Upvotes: 1

Regex capture group nesting

Answers (2)

Related Questions