Tim Butterfield
Tim Butterfield

Reputation: 587

RegEx - Parse Csv Text

So there are loads of posts on here that note rather than rolling my own csv parser I ought to use either the Vb.Net TextFiledParser.

I tried it but, and please tell me if I'm wrong, it'll parses based on a single delimeter.

So if I have an address field "Flat 1, StackOverflow House, London" I get three fields. Unfortunately that's not what I want. I need everything in a given cell to remain as a single item in the array.

So I started to write my own RegEx as follows :

var testString = @"""Test 1st string""" + "," + @"""Flat 1, StackOverflow House, London, England, The Earth""" + "," + "123456";

var matches = Regex.Matches(chars, @"""([^""\\])*?(?:\\.[^""\\]*)*?""");
var numbers = Regex.Matches(chars, @"\d+$");//only numbers
Assert.That(results.Count(), Is.EqualTo(3));
Assert.That(secondMatch.Count, Is.EqualTo(1));

The first assertion fails as the string "123456" is not returned. The expression only returns "Test 1st string" and "Flat 1, StackOverflow House, London, England, The Earth"

What I'd like is for the regex to return everything quoted\escaped, and numbers.

I don't control the data but figure strings will all be quoted\escaped and numbers won't.

I'd really appreciate some help as I'm going around in circles trying third party libraries without much success.

Needless to say string.split doesn't work in the case of addresses, and http://www.filehelpers.com/ doesn't seem to account for such examples.

Upvotes: 2

Views: 1663

Answers (2)

paulslater19
paulslater19

Reputation: 5917

A hacky way that I used to quickly get round it was to first Split by quotation marks, then in between every other index, strip out the quotes (or replace them with something). Then Split the string again on the commas

Just found this: Javascript code to parse CSV data - I appreciate that it's JavaScript and not vb.net. However, you should be able to follow it

Also How can I parse a CSV string with Javascript, which contains comma in data?

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336098

Just to give you an idea what you're up against: Here's a regex that should work quite well. But you definitely need to test the heck out of it since there are so many corner cases with CSV that I'm sure to have missed some (And I'm assuming the comma as the separator and " as quote character (which is escaped by doubling)):

(?:           # Match either
 (?>[^",\n]*) #  0 or more characters except comma, quote or newline
|             # or
 "            #  an opening quote
 (?:          #  followed by either
  (?>[^"]*)   #   0 or more non-quote characters
 |            #  or
  ""          #   an escaped quote ("")
 )*           #  any number of times
 "            #  followed by a closing quote
)             # End of alternation
(?=,|$)       # Assert that the next character is a comma (or end of line)

In VB.NET:

Dim ResultList As StringCollection = New StringCollection()
Dim RegexObj As New Regex(
    "(?:            # Match either" & chr(10) & _
    " (?>[^"",\n]*) #  0 or more characters except comma, quote or newline" & chr(10) & _
    "|              # or" & chr(10) & _
    " ""            #  an opening quote" & chr(10) & _
    " (?:           #  followed by either" & chr(10) & _
    "  (?>[^""]*)   #   0 or more non-quote characters" & chr(10) & _
    " |             #  or" & chr(10) & _
    "  """"         #   an escaped quote ("""")" & chr(10) & _
    " )*            #  any number of times" & chr(10) & _
    " ""            #  followed by a closing quote" & chr(10) & _
    ")              # End of alternation" & chr(10) & _
    "(?=,|$)        # Assert that the next character is a comma (or end of line)", 
    RegexOptions.Multiline Or RegexOptions.IgnorePatternWhitespace)
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Value)
    MatchResult = MatchResult.NextMatch()
End While

Upvotes: 2

Related Questions