Adam Mrozek
Adam Mrozek

Reputation: 1480

Regex split by comma not inside parenthesis (.NET)

I need to split text (sql query) by each comma which is not between parenthesis.

Example (I marked commas which should be included in split):

a."Id",                //<- this comma
a."Description",       //<- this comma
UJsonObject(
   fepv."Id",          //<- NOT this comma
   fepv."SystemName",  //<- NOT this comma
   string_agg(
        translations."Translations", ',' //<- NOT this comma (here can be some nested parenthesis also)
   ) as "Translations"
) as "Translations",   //<- this comma
b."DataSource",        //<- this comma
a."Name",              //<- this comma
a."Value"

I found universal solution here: https://regex101.com/r/6lQKjP/2 but it appears that this solution is not working in dotnet.

I would like to use Regex.Split, but if this case can be satisfied by Regex.Matches I will be happy too. Also I know I can write my own parser, but I read that simple cases (which not extract nested parenthesis) can be handled via Regex.

Upvotes: 3

Views: 609

Answers (3)

Kobi
Kobi

Reputation: 138037

You can match your tokens in a single pass using a .NET regular expression Balancing Groups:

(?>
    (?<S>\()      # if you see an open parentheses, push it to the stack
    |
    (?<-S>\))     # match a closing parentheses when the stack has a paired open parentheses
    |
    [^,()]        # match any character except parentheses or commas
    |
    (?(S),|(?!))  # if we're already inside parentheses, we're allowed to match a comma
)+
(?(S)(?!))    # at the end, make sure there are no extra open parentheses we didn't close.

You can get the tokens as:

var matches = Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace)
                   .Select(m => m.Value).ToList();

Working example in Sharp Labs

This approach is a bit complicated, but the syntax it supports can be expanded without too much trouble. For example, we can add support for -- single line SQL comments comments and 'SQL strings':

(?>
    (?<S>\()
    |
    (?<-S>\))
    |
    --.*                 # match from "--" to the end of the line
    |
    '[^']*(?:''[^']*)*'  # match SQL string, single quote, escaped by two single quotes
    |
    [^,()]
    |
    (?(S),|(?!))
)+
(?(S)(?!))

Working example

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110685

Please regard this as an extended comment. In pseudo-code, the commas not enclosed in parentheses can be identified as follows:

commas = []
n = 0
for each index i of string
   c = char at index i of string
   if c == '('
     increase n by 1
   elsif c == ')'
     decrease n by 1 if n > 0, else raise unbalanced parens exception
   elsif c == ','
     add i to commas if n equals 0
   end
end
raise unbalanced parens exception if n > 0

The array comma will contain the indices of the commas on which the string is to be split. Splitting the string at given indices is straightforward.

The variable n equals the number of left parentheses that are not yet matched by a right parentheses. The code also confirms that the parentheses are balanced.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

This PCRE regex - (\((?:[^()]++|(?1))*\))(*SKIP)(*F)|, - uses recursion, .NET does not support it, but there is a way to do the same thing using balancing construct. The From the PCRE verbs - (*SKIP) and (*FAIL) - only (*FAIL) can be written as (?!) (it causes an unconditional fail at the place where it stands), .NET does not support skipping a match at a specific position and resuming search from that failed position.

I suggest replacing all commas that are not inside nested parentheses with some temporary value, and then splitting the string with that value:

var s = Regex.Replace(text, @"\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|(,)", m =>
     m.Groups[1].Success ? "___temp___" : m.Value);
var results = s.Split("___temp___");

Details

  • \((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\) - a pattern that matches nested parentheses:
    • \( - a ( char
    • (?>[^()]+|(?<o>)\(|(?<-o>)\))* - 0 or more occurrences of
      • [^()]+| - 1+ chars other than ( and ) or
      • (?<o>)\(| - a ( and a value is pushed on to the Group "o" stack
      • (?<-o>)\) - a ) and a value is popped from the Group "o" stack
    • (?(o)(?!)) - a conditional construct that fails the match if Group "o" stack is not empty
    • \) - a ) char
  • | - or
  • (,) - Group 1: a comma

Only the comma captured in Group 1 is replaced with a temp substring since the m.Groups[1].Success check is performed in the match evaluator part.

Upvotes: 5

Related Questions