user3204597
user3204597

Reputation: 83

Regex capture group nesting

I've just started learning regex and I've been stuck on this lesson for a while now.

enter image description here

I don't quite understand why the below 2 won't work.

^(.+(\d+))$

^([a-zA-Z_]+(\d+))$

Upvotes: 2

Views: 3485

Answers (2)

user557597
user557597

Reputation:

For this ^(.+(\d+))$
If we put a capture group around (.+)
From the output we see that for .+, the engine initially matches
the entire string Jan 1987 then backtracks 1 character at a time until
it can satisfy the next sub-expression (\d+).
It finds character 7 and sees that satisfies the expression.
Its done its job, all sub-expression's are satisfied, so it stops.

That's typical behavior of greedy quantifiers. Some things that could disrupt this behavior:

  1. Add a non-greedy qualifier ? to the expression:
    Looks like this (.+?). What this does is instead of initially matching
    the entire string, it incrementally matches 1 character at a time.
    Each time it matches a character it checks the next character to see if it
    would match the next sub-expression \d+. Since it does, it leaves the current sub-expression .+? and continues with the next one \d+.
    The process continues onto the next sub-expression, etc..
    Any time it fails along the way, the engine goes back to the previous successful sub-expression, at its previous successful match position,
    then decrements that match position, then repeats the whole process.
    It could go all the way back to the very first sub-expression with this
    process... This is called backtracking.

  2. Keep the greedy sub-expression .+ but at certain points add
    known literal character(s) just before the next sub-expression.
    These are waypoints for the engine to anchor on and are called pseudo-anchors.
    It disrupts backtracking. For example, you could have added a space literal
    just before the \d+ sub-expression, like ^(.+ (\d+))$. This forces the
    engine to backtrack from the last digit, back until it finds a space, letting \d+ consume all the year digits.

  3. Stay greedy, but reduce the allowed characters in sub-expressions.
    Instead of using the Dot meta-character (matches any character), specify
    a class of limited characters.

In general, use the Dot meta-character where either you don't care what is
there, or don't know what's there. It will always be faster than classes.
However, when using this, try to have a strategy of pseudo-anchors in the
following sub-expression. This will allow the engine to zero in, where it
matches at the right frame point for your subject text.

 ^ 
 (                             # (1 start)
      ( .+ )                        # (2)
      ( \d+ )                       # (3)
 )                             # (1 end)
 $ 

Output:

  **  Grp 0 -  ( pos 0 , len 8 ) 
 Jan 1987  
  **  Grp 1 -  ( pos 0 , len 8 ) 
 Jan 1987  
  **  Grp 2 -  ( pos 0 , len 7 ) 
 Jan 198  
  **  Grp 3 -  ( pos 7 , len 1 ) 
 7  

Upvotes: 1

JosEduSol
JosEduSol

Reputation: 5456

Both work ok, but you need a space before the nested group.

Your regex would work if the text was for example: Jan1987. But the examples in the link you posted are like: Jan 1987

Try:

^(.+ (\d+))$

^([a-zA-Z_]+ (\d+))$

Upvotes: 1

Related Questions