Reputation: 87

Do not match opening and closing parenthesis when a character sequence appears in middle

Got an interesting problem here for everyone to consider:

I am trying to parse and tokenize strings delimited by a "/" character but only when not in between parenthesis.

For instance:

Root/Branch1/branch2/leaf

Should be tokenized as: "Root", "Branch1", "Branch2", "leaf"

Root/Branch1(subbranch1/subbranch2)/leaf

Should be tokenized as: "Root", "Branch1(subbranch1,subbranch2)", "leaf"

Root(branch1/branch2) text (branch3/branch4) text/Root(branch1/branch2)/Leaf

Should be tokenized as: "Root(branch1/branch2) text(branch3/branch4)", "Root(branch1/branch2)", "leaf".

I came up with the following expression which works great for all cases except ONE!

([^/()]*\((?<=\().*(?=\))\)[^/()]*)|([^/()]+)

The only case where this does not work is the following test condition:

Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf

This should be tokenized as: "Root(branch1/branch2)", "SubRoot", "SubRoot(branch3/branch4)", "Leaf"

The result I get instead consists of only one group that matches the whole line so it is not tokenizing it at all:

"Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf"

What is happening here is that because Regex is greedy it is matching the left most opening parenthesis "(" with the last closing parenthesis ")" instead of just knowing to stop at its appropriate delimiter.

Any of you Regex gurus out there can help me figure out how to add a small Regex piece to my existing expression to handle this additional case?

Root(branch1/branch2) Test (branch3/branch4)/SubRoot/SubRoot(branch5/branch6)/Leaf

Should be tokenized into groups as:

"Root(branch1/branch2) Test (branch3/branch4)"
"SubRoot"
"SubRoot(branch5/branch6)"
"Leaf"

Upvotes: 1

Answers (3)

Rawling

Reputation: 50114

The following uses balanced groups to capture each matching item with Regex.Matches, ensuring the closing / isn't matched when the brackets before it haven't balanced:

(?<=^|/)((?<br>\()|(?<-br>\))|[^()])*?(?(br)(?!))(?=$|/)

Bizarrely, this seems to perform similarly to Billy Moon's much simpler answer, even though this is overengineered (supporting multiple, possibly nested sets of brackets per token).

The following does something similar, but splits the string with Regex.Split (linebreaks added for clarity):

(?<=^(?(brb)(?!))(?:(?<-brb>\()|(?<brb>\))|[^()])*)
/
(?=(?:(?<bra>\()|(?<-bra>\))|[^()])*(?(bra)(?!))$)

This matches "any / where any brackets between the start of the string and the / are balanced, and any bracket between the / and the end of the string are balanced".

Note that in the lookbehind, the brb captures appear in reverse order from before - this is because a lookbehind apparently works right-to-left. (Thanks to Kobi for the answer that taught me this.)

This is much slower than the match version, but I wanted to work out how to do it anyway.

Upvotes: 0

Billy Moon

Reputation: 58531

Different approach, trying to avoid costly look-around assertions...

/(\(.+?\)|[^\/(]+)+/

With some comments...

/
(           # group things to be captured
  \(.+?\)   # 1 or more of anything in (escaped) brackets, un-greedily
|           # or ...
  [^\/(]+   # 1 or more, not slash, and not open bracket characters
)+          # repeat until done...
/

Upvotes: 1

James Curran

Reputation: 103495

List<string> Tokenize(strInput)
{
  var sb = new StringBuilder();
  var tokens = new List<string>();
  bool inParen = false;
  foreach(var c in strInput)
  {
      if (inParens)
      {
           if (c == ')')
               inParens = false;
           else
               sb.Append(c);
       }
       else if (c == '(')
               inParens = true;
       else if (c == '/')
            {
                 tokens.Add(sb.ToString());
                 sb.Length = 0;
            }
       else
             sb.Append(c);

  }
  if (sb.Length > 0)
      tokens.Add(sb.ToString());

  return tokens;
}

That's untested but it should work. (and will almost certainly be much faster than the regex)

Upvotes: 1

Do not match opening and closing parenthesis when a character sequence appears in middle

Answers (3)

Related Questions