Reputation: 1480
I need to split text (sql query) by each comma which is not between parenthesis.
Example (I marked commas which should be included in split):
a."Id", //<- this comma
a."Description", //<- this comma
UJsonObject(
fepv."Id", //<- NOT this comma
fepv."SystemName", //<- NOT this comma
string_agg(
translations."Translations", ',' //<- NOT this comma (here can be some nested parenthesis also)
) as "Translations"
) as "Translations", //<- this comma
b."DataSource", //<- this comma
a."Name", //<- this comma
a."Value"
I found universal solution here: https://regex101.com/r/6lQKjP/2 but it appears that this solution is not working in dotnet.
I would like to use Regex.Split, but if this case can be satisfied by Regex.Matches I will be happy too. Also I know I can write my own parser, but I read that simple cases (which not extract nested parenthesis) can be handled via Regex.
Upvotes: 3
Views: 609
Reputation: 138037
You can match your tokens in a single pass using a .NET regular expression Balancing Groups:
(?>
(?<S>\() # if you see an open parentheses, push it to the stack
|
(?<-S>\)) # match a closing parentheses when the stack has a paired open parentheses
|
[^,()] # match any character except parentheses or commas
|
(?(S),|(?!)) # if we're already inside parentheses, we're allowed to match a comma
)+
(?(S)(?!)) # at the end, make sure there are no extra open parentheses we didn't close.
You can get the tokens as:
var matches = Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace)
.Select(m => m.Value).ToList();
This approach is a bit complicated, but the syntax it supports can be expanded without too much trouble. For example, we can add support for -- single line SQL comments
comments and 'SQL strings'
:
(?>
(?<S>\()
|
(?<-S>\))
|
--.* # match from "--" to the end of the line
|
'[^']*(?:''[^']*)*' # match SQL string, single quote, escaped by two single quotes
|
[^,()]
|
(?(S),|(?!))
)+
(?(S)(?!))
Upvotes: 1
Reputation: 110685
Please regard this as an extended comment. In pseudo-code, the commas not enclosed in parentheses can be identified as follows:
commas = []
n = 0
for each index i of string
c = char at index i of string
if c == '('
increase n by 1
elsif c == ')'
decrease n by 1 if n > 0, else raise unbalanced parens exception
elsif c == ','
add i to commas if n equals 0
end
end
raise unbalanced parens exception if n > 0
The array comma
will contain the indices of the commas on which the string is to be split. Splitting the string at given indices is straightforward.
The variable n
equals the number of left parentheses that are not yet matched by a right parentheses. The code also confirms that the parentheses are balanced.
Upvotes: 0
Reputation: 626926
This PCRE regex - (\((?:[^()]++|(?1))*\))(*SKIP)(*F)|,
- uses recursion, .NET does not support it, but there is a way to do the same thing using balancing construct. The From the PCRE verbs - (*SKIP)
and (*FAIL)
- only (*FAIL)
can be written as (?!)
(it causes an unconditional fail at the place where it stands), .NET does not support skipping a match at a specific position and resuming search from that failed position.
I suggest replacing all commas that are not inside nested parentheses with some temporary value, and then splitting the string with that value:
var s = Regex.Replace(text, @"\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|(,)", m =>
m.Groups[1].Success ? "___temp___" : m.Value);
var results = s.Split("___temp___");
Details
\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)
- a pattern that matches nested parentheses:
\(
- a (
char(?>[^()]+|(?<o>)\(|(?<-o>)\))*
- 0 or more occurrences of
[^()]+|
- 1+ chars other than (
and )
or(?<o>)\(|
- a (
and a value is pushed on to the Group "o" stack(?<-o>)\)
- a )
and a value is popped from the Group "o" stack(?(o)(?!))
- a conditional construct that fails the match if Group "o" stack is not empty\)
- a )
char|
- or(,)
- Group 1: a commaOnly the comma captured in Group 1 is replaced with a temp substring since the m.Groups[1].Success
check is performed in the match evaluator part.
Upvotes: 5