QldRobbo
QldRobbo

Reputation: 129

Splitting csv like string using regex

I have a regex pattern defined as

var pattern = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))";

and I am trying to split some CSV like strings to get fields

Some example strings that WORK with this regex are

_input[0] = ""; // expected single blank field
_input[1] = "A,B,C"; // expected three individual fields
_input[2] = "\"A,B\",C"; // expected two fields 'A,B' and C
_input[3] = "\"ABC\"\",\"Text with,\""; // expected two fields, 'ABC"', 'Text with,'
_input[4] = "\"\",ABC\",\"next_field\""; // expected two fields, '",ABC', 'next_field'

However, this is not working

_input[5] = "\"\"\",ABC\",\"next_field\"";

I am expecting three fields

'"', 'ABC"', 'next_field'

But I am getting two fields

'"",ABC', 'next_field'

Can anybody help with this regex?

I think the strange part is that the second column doesn't have quotes at the start and end of the value, just at the end. So the first column's value is empty, and the second column is ABC"

Thanks, Rob

Upvotes: 3

Views: 1784

Answers (1)

Johny Skovdal
Johny Skovdal

Reputation: 2104

I think you need to be even more specific about what your logic is in terms of how the double quotes should be treated, as it appears that your requirements conflicts with each other.

My quick version that I think comes closest to what you are trying to achieve is this (please note 1) The missing escaping of double quotes, because I am using an external tool to validate the regex, and 2) I have changed how to retrieve the matched values, see the bottom for an example):

(?<Match>(?:"[^"]*"+|[^,])*)(?:,(?<Match>(?:"[^"]*"+|[^,])*))*

It has the following logic:

  • If there is a double quote, then include everything in it, until an end double quote is hit.
  • When reaching an end double quote, double quotes following immediately after will also be included.
  • If the next character is anything but a comma, it is included, and the above is tested again.
  • If it is a comma, the current match is concluded and a new one begins after the comma.

The above logic conflicts with what you expect from index 4 and 5 however, because I get:

[4] = '""' and 'ABC","next_field"'
[5] = '"""' and 'ABC","next_field"'

If you could point out why the above logic is wrong for your needs/expectations, I'll edit my answer with a fully working regex.

To retrieve your values, you could do it like this:

string pattern = @"(?<Match>(?:""[^""]*""+|[^,])*)(?:,(?<Match>(?:""[^""]*""+|[^,])*))*";

string[] testCases = new[]{
  @"",
  @"A,B,C",
  @"A,B"",C",
  @"ABC"",""Text with,",
  @""",ABC"",""next_field""",
  @""""",ABC"",""next_field"""
};

foreach(string testCase in testCases){
  var match = System.Text.RegularExpressions.Regex.Match(testCase, pattern);
  string[] matchedValues = match.Groups["Match"].Captures
    .Cast<System.Text.RegularExpressions.Capture>()
    .Select(c => c.Value)
    .ToArray();
}

Upvotes: 3

Related Questions