Miguel Noronha
Miguel Noronha

Reputation: 31

Catching a pattern, but ignoring it within quotes

So, what I need to do in c# regex is basically split a string whenever I find a certain pattern, but ignore that pattern if it is surrounded by double quotes in the string.

Example:

string text = "abc , def , a\" , \"d , oioi";
string pattern = "[ \t]*,[ \t]*";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (3 splits, 4 strings):

    {"abc",
     "def",
     "a\" , \"d",
     "oioi"}

Actual result (4 splits, 5 strings):

    {"abc",
     "def",
     "a\"",
     "\"d",
     "oioi"}

Another example:

string text = "a%2% 6y % \"ad%t6%&\" %(7y) %";
string pattern = "%";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (5 splits, 6 strings):

    {"a",
     "2",
     " 6y ",
     " \"ad%t6%&\" ",
     "(7y) ",
     ""}

Actual result (7 splits, 8 strings):

    {"a",
     "2",
     " 6y ",
     "\"ad",
     "t6",
     "&\" ",
     "(7y) ",
     ""}

A 3rd example, to exemplify a tricky split where only the first case should be ignored:

string text = "!!\"!!\"!!\"";
string pattern = "!!";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (2 splits, 3 strings):

    {"",
     "\"!!\"",
     "\""}

Actual result (3 splits, 4 strings):

    {"",
     "\"",
     "\"",
     "\"",}

So, how do I move from pattern to a new pattern that achieves the desired result?

Sidenote: If you're going to mark someone's question as duplicate (and I have nothing against that), at least point them to the right answer, not to some random post (yes, I'm looking at you, Mr. Avinash Raj)...

Upvotes: 1

Views: 459

Answers (2)

ΩmegaMan
ΩmegaMan

Reputation: 31576

I think this is a two step process and it has been overthought trying to make it a one step regex.


Steps

  1. Simply remove any quotes from a string.
  2. Split on the target character(s).

Example of Process

I will split on the , for step 2.

var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");

 // `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);

// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi

Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
     .OfType<Match>()
     .Select(mt => mt.Groups[1].Value )

Result

enter image description here

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

The rules are more or less like in a csv line except that:

  • the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
  • an orphan quote is allowed for the last item.

First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$) is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)

The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).

Here is a general way to do it:

^
(?>
  (?:delimiter | start_of_the_string)
  (
      simple_part
      (?>
          (?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
          simple_part
      )*
  )
)+
$

Example with \s*,\s* as delimiter:

^
# non-capturing group for one delimiter and one item
(?>
    (?: \s*,\s* | ^ ) # delimiter or start of the string
                      # (eventually change "^" to "^ \s*" to trim the first item)

    # capture group 1 for the item 
    (   # simple part of the item (maybe empty):
        [^\s,"]* # all that is not the quote character or one of the  possible first
                 # character of the delimiter
        # edge case followed by a simple part
        (?>
            (?: # edge cases
                " [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
              |   # OR
                (?> \s+ ) # start of the delimiter
                (?!,)     # but not the delimiter
            )

            [^\s,"]* # simple part
        )*
    )
)+
$

demo (click on the table link)

The pattern is designed for the Regex.Match method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.

This example can be easily adapted to all cases.

(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)* instead of " [^"]* (?:"|$),
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)

Upvotes: 2

Related Questions