Alex Reinking
Alex Reinking

Reputation: 19916

Regex recursion: match two-character opening

I'm trying to use recursive regex to match bash-like variable expansions. Basically, I should be able to match strings like the following:

${FOO=BAR}
${FOO=${BAR=BAZ}}

But also handle inputs like this:

${FOO=\${BAR=BAZ}}
${FOO={${BAR=BAZ}}

In the first case, it should match up to, but excluding the final }, and the second case should match completely. This is because I'm trying to match the two-character opening ${ with the one-character closing }. Both the opening and closing should be able to be escaped.

I've gotten as far as the PCRE regex \$\{(?:[^{}]|(?R))*\}. But this doesn't handle escaping correctly. If I amend it to be (?:^|[^\\])(?:\\\\)*(\$\{(?:[^{}]|(?R))*\}), then only outermost escapes match correctly.

Can this be done with regex, or am I better off just writing a pyparsing parser?

Upvotes: 1

Views: 182

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can try this pattern:

(?s)\\.(*SKIP)(*F)|(?s)(\${(?>[^$}\\]+|\\.|(?1))*})

online example

details:

(?s)
\\.               # an escaped character
(*SKIP)           # skip the matched content if the pattern fails later
(*F)              # force the pattern to fail
|
(?s)
(
    \${
    (?>           # open a atomic group
        [^$}\\]+  # all that is not a backslash, a $ or a }
      |           # OR
        \\.       # an escaped character
      |           # OR
        (?1)      # recurse to group 1
    )*            # repeat the atomic group zero or more times
    }               
)

The main idea is to avoid that an escaped dollar followed by an opening curly bracket is considered as an opening tag.

Notes: instead of using inline modifiers (?s) for each branchs, you can remove them and use the global modifier s.

To be fully rigorous, you can allow a $ not followed by an opening curly bracket in the content by adding the alternative \$(?!{) in the atomic group. (before the recursion)

About (*SKIP) and (*FAIL):

(*SKIP) and (*FAIL) are called backtracking control verbs.

When a pattern fails the default behaviour of an NFA regex engine is to use the backtracking mechanism. These verbs allow a control of this mechanism.

To be more concrete, the goal of the combo subpattern(*SKIP)(*FAIL) is to exclude the content matched by the subpattern from the match result, and to forbid the regex engine to try anything else with the matched substring. The substring is skipped.

You can see a full explanation here.

About atomic grouping:

An atomic group is a non-capturing group. The only difference is once the closing parenthesis reached, the regex engine is no more allowed to backtrack inside characters matched between parenthesis. It can only go to the position before the group. The atomic group makes the matched substring indivisible (atomic).

Here, the atomic group prevents catastrophic backtracking that may occurs with this kind of constructions (?:A+|B)+ if the pattern fails later.

Upvotes: 3

Related Questions