Reputation: 31
So, what I need to do in c# regex is basically split a string whenever I find a certain pattern, but ignore that pattern if it is surrounded by double quotes in the string.
Example:
string text = "abc , def , a\" , \"d , oioi";
string pattern = "[ \t]*,[ \t]*";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (3 splits, 4 strings):
{"abc",
"def",
"a\" , \"d",
"oioi"}
Actual result (4 splits, 5 strings):
{"abc",
"def",
"a\"",
"\"d",
"oioi"}
Another example:
string text = "a%2% 6y % \"ad%t6%&\" %(7y) %";
string pattern = "%";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (5 splits, 6 strings):
{"a",
"2",
" 6y ",
" \"ad%t6%&\" ",
"(7y) ",
""}
Actual result (7 splits, 8 strings):
{"a",
"2",
" 6y ",
"\"ad",
"t6",
"&\" ",
"(7y) ",
""}
A 3rd example, to exemplify a tricky split where only the first case should be ignored:
string text = "!!\"!!\"!!\"";
string pattern = "!!";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (2 splits, 3 strings):
{"",
"\"!!\"",
"\""}
Actual result (3 splits, 4 strings):
{"",
"\"",
"\"",
"\"",}
So, how do I move from pattern to a new pattern that achieves the desired result?
Sidenote: If you're going to mark someone's question as duplicate (and I have nothing against that), at least point them to the right answer, not to some random post (yes, I'm looking at you, Mr. Avinash Raj)...
Upvotes: 1
Views: 459
Reputation: 31576
I think this is a two step process and it has been overthought trying to make it a one step regex.
Steps
Example of Process
I will split on the ,
for step 2.
var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");
// `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);
// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi
Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
.OfType<Match>()
.Select(mt => mt.Groups[1].Value )
Result
Upvotes: 0
Reputation: 89547
The rules are more or less like in a csv line except that:
First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$)
is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)
The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).
Here is a general way to do it:
^
(?>
(?:delimiter | start_of_the_string)
(
simple_part
(?>
(?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
simple_part
)*
)
)+
$
Example with \s*,\s*
as delimiter:
^
# non-capturing group for one delimiter and one item
(?>
(?: \s*,\s* | ^ ) # delimiter or start of the string
# (eventually change "^" to "^ \s*" to trim the first item)
# capture group 1 for the item
( # simple part of the item (maybe empty):
[^\s,"]* # all that is not the quote character or one of the possible first
# character of the delimiter
# edge case followed by a simple part
(?>
(?: # edge cases
" [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
| # OR
(?> \s+ ) # start of the delimiter
(?!,) # but not the delimiter
)
[^\s,"]* # simple part
)*
)
)+
$
demo (click on the table link)
The pattern is designed for the Regex.Match
method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.
This example can be easily adapted to all cases.
(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)*
instead of " [^"]* (?:"|$)
,
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)
Upvotes: 2