Reputation: 87
Got an interesting problem here for everyone to consider:
I am trying to parse and tokenize strings delimited by a "/"
character but only when not in between parenthesis.
For instance:
Root/Branch1/branch2/leaf
Should be tokenized as: "Root"
, "Branch1"
, "Branch2"
, "leaf"
Root/Branch1(subbranch1/subbranch2)/leaf
Should be tokenized as: "Root"
, "Branch1(subbranch1,subbranch2)"
, "leaf"
Root(branch1/branch2) text (branch3/branch4) text/Root(branch1/branch2)/Leaf
Should be tokenized as: "Root(branch1/branch2) text(branch3/branch4)"
, "Root(branch1/branch2)"
, "leaf"
.
I came up with the following expression which works great for all cases except ONE!
([^/()]*\((?<=\().*(?=\))\)[^/()]*)|([^/()]+)
The only case where this does not work is the following test condition:
Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf
This should be tokenized as: "Root(branch1/branch2)"
, "SubRoot"
, "SubRoot(branch3/branch4)"
, "Leaf"
The result I get instead consists of only one group that matches the whole line so it is not tokenizing it at all:
"Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf"
What is happening here is that because Regex is greedy it is matching the left most opening parenthesis "("
with the last closing parenthesis ")"
instead of just knowing to stop at its appropriate delimiter.
Any of you Regex gurus out there can help me figure out how to add a small Regex piece to my existing expression to handle this additional case?
Root(branch1/branch2) Test (branch3/branch4)/SubRoot/SubRoot(branch5/branch6)/Leaf
Should be tokenized into groups as:
"Root(branch1/branch2) Test (branch3/branch4)" "SubRoot" "SubRoot(branch5/branch6)" "Leaf"
Upvotes: 1
Views: 11121
Reputation: 50114
The following uses balanced groups to capture each matching item with Regex.Matches
, ensuring the closing /
isn't matched when the brackets before it haven't balanced:
(?<=^|/)((?<br>\()|(?<-br>\))|[^()])*?(?(br)(?!))(?=$|/)
Bizarrely, this seems to perform similarly to Billy Moon's much simpler answer, even though this is overengineered (supporting multiple, possibly nested sets of brackets per token).
The following does something similar, but splits the string with Regex.Split
(linebreaks added for clarity):
(?<=^(?(brb)(?!))(?:(?<-brb>\()|(?<brb>\))|[^()])*)
/
(?=(?:(?<bra>\()|(?<-bra>\))|[^()])*(?(bra)(?!))$)
This matches "any /
where any brackets between the start of the string and the /
are balanced, and any bracket between the /
and the end of the string are balanced".
Note that in the lookbehind, the brb
captures appear in reverse order from before - this is because a lookbehind apparently works right-to-left. (Thanks to Kobi for the answer that taught me this.)
This is much slower than the match version, but I wanted to work out how to do it anyway.
Upvotes: 0
Reputation: 58531
Different approach, trying to avoid costly look-around assertions...
/(\(.+?\)|[^\/(]+)+/
With some comments...
/
( # group things to be captured
\(.+?\) # 1 or more of anything in (escaped) brackets, un-greedily
| # or ...
[^\/(]+ # 1 or more, not slash, and not open bracket characters
)+ # repeat until done...
/
Upvotes: 1
Reputation: 103495
List<string> Tokenize(strInput)
{
var sb = new StringBuilder();
var tokens = new List<string>();
bool inParen = false;
foreach(var c in strInput)
{
if (inParens)
{
if (c == ')')
inParens = false;
else
sb.Append(c);
}
else if (c == '(')
inParens = true;
else if (c == '/')
{
tokens.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c);
}
if (sb.Length > 0)
tokens.Add(sb.ToString());
return tokens;
}
That's untested but it should work. (and will almost certainly be much faster than the regex)
Upvotes: 1