Baxter
Baxter

Reputation: 5835

Optimize Line Break Regular Expression

I am creating a string array of comma separated value lines.
I need to split the line on \n but not line breaks that are inside quotes.

Here is the code currently:

string[] lines = Regex.Split(value, @"\n(?=(?:[^""]*""[^""]*"")*(?![^""]*""))");

It takes a really long time to execute.
Is there a better option I could be using?

Thanks for any help on this.

UPDATE Here is an example of \n I would want to skip because they are in quotes:

\"Address- 430 Building F\r\n\r\nNickname- Joe\"
I don't know why the data is all crazy like that but I don't want it splitting on those \n in the quotes.

Upvotes: 2

Views: 92

Answers (1)

BlackBear
BlackBear

Reputation: 22979

You can do this without regexes by splitting at each \n and counting the number of quotes of each line and the preceding one.

For the first line:

  • Even number of quotes: you should split;
  • Odd number of quotes: you should not split;

For the other lines:

  • Even number of quotes: do what you did for the previous line;
  • Odd number of quotes: do the opposite of what you did for the previous line;

The idea behind this is that, since quotes have to appear in pairs, if there is an odd number of quotes either one is not closed or a quote from the previous line was closed in this one and, conversely, if there is an even number then it is happening the same as the previous line.
Basically you split at every \n, then 'unsplit' (merge together) two lines if the \n was inside quotes.
The good thing about this approach is that it can be easily parallelized!

For example, take the following text:

"aa", "bb", "cc",\n
"11", "22", "3\n
3",\n
"xx", "y\n
y", "z\n
z"

Here's how this would work:

"aa", "bb", "cc",

First line with an even number of quotes (6), so this line was correctly splitted.

"11", "22", "3

Odd number of quotes (5) so you should do the opposite as the previous line. Since the previous line was splitted this one should not have been splitted, so merge this with the next one. Indeed, the last quote is closed on the next line.

3",

Odd number of quotes (1); do the opposite. Last time we merged so this split is correct.

"xx", "y

Odd number of quotes, so we should do the opposite. Previously we did not merge so this time we merge this line with the following one.

y", "z

Even number of quotes, do as you did in the previos line (i.e. merge with the following one).

z"

Odd number of quotes so this time we should not merge. (The need to merge the last line with the one following it is a sign of a bad formed input data).
The final output is therefore:

"aa", "bb", "cc",\n
"11", "22", "33",\n
"xx", "yy", "zz"\n

Upvotes: 1

Related Questions