How do I match multi-character wrapper in string?

Question

I'll try to make this short and easy since I am having a hard time trying to put into words what exactly it is I am trying to do.

Basically I am trying to match tokens inside tokens or the entire token. The regex I have below works except for when there is a random { that isn't part of a token.

Example:

Tokens start with "{:" and end with ":}"
{:MyTokenFunction({:MyTokenParameter:}):} 
^--- WORKS as it matches "{:MyTokenParameter:}"

{:MyTokenFunction(5):} 
^--- WORKS as it matches "{:MyTokenFunction(5):}"

{:MyTokenFunction(random{string}):} 
^--- The "{" causes no matches, but should match the entire string.

Here's a colored example of what the regex I have matches on. *the first 2 examples are correct, but the 3rd example should match entirely and it doesn't at all. enter image description here

Here's the regex I am currently using which is having issues with the third example:

\{\:[^\{]+?\:\}

For the life of me I cannot figure out how to get around the { causing 0 matches.

I tried to use lookbehinds/aheads, but I wasn't having much luck. Although I would of course love a quick answer; I would love an explanation of what the regex is actually doing more. I have done a lot of searching to try and figure this out, but was unable to find a good example due to the fact that my "tokens" are wrapped by multiple characters and start/end aren't the same.

Thanks

zx81 · Accepted Answer

This is a lovely question because it requires us to balance opening and closing tokens, a task for which .NET happens to have a ready-made feature: balancing groups.

Let's look at this in separate pieces.

Why doesn't your regex work?

[^\{]+ means "match any number of characters that are not a {"

Clearly, that is not going to be able to match the { in {3}

Simple solution (with caveats)

{:.*:}

This will greedily match everything between the opening and closing curly brace. This works if you have only one token per line (and if you are not in DOTALL mode).

However, if you have two tokens on the same line, the regex will eat them both. And if you are in DOTALL mode, this will eat all the tokens. So that's for you to know.

See demo

More complex (but far stronger) solution

To avoid the problem above, we need to balance the braces. In Perl or PCRE, we would use recursion. Since we're in .NET, we'll use balancing groups, which are a beautiful feature of the .NET engine.

Here is one way to do it. That's a mouthful, but I'll explain it below.

(?:{:(?)(?:(?!{:|:}).)*)+(?::}(?<-counter>)(?:(?!{:|:}).)*)+(?<=:})(?(counter)(?!))

See demo

How does this work?

Here is the same regex, but in free-spacing mode, with comments. I would suggest using this version in code, as it makes it easier to maintain.

(?x) # free-spacing mode
(?:{:(?)(?:(?!{:|:}).)*)+ # match all the opening {: and increment counter
(?::}(?<-counter>)(?:(?!{:|:}).)*)+ # # match all the closing {: and decrement counter
(?<=:}) # negative lookbehind: we must close tiwht a :} (backtrack if we went too far)
(?(counter)(?!)) # if the counter has not been decremented to zero, then fail (ensuring balance)

Potential Tweaks

Depending on your needs, there are potential tweaks: for instance, if you want tokens to be able to span several lines. Just let us know.

How do I match multi-character wrapper in string?

Answers (2)

Related Questions