SF Lee
SF Lee

Reputation: 1777

Regex for parsing CSV

I'm trying to write a Regex that that will extract individual fields from a CSV file.

For example, if given the following line in a CSV file:

123,    Bob    ,Bob, " Foo Bar ", "a, ""b"", c"

Should give the following results (without the single quotes):

'123'
'Bob'
'Bob'
' Foo Bar '
'a, "b", c'

Note that leading and trailing white spaces should be trimmed unless they are within quotes.

I'm not worried about invalid CSV lines such as open quotes without matching closing quotes. You can safely assume that the CSV file is perfectly valid according to the rules above.

I'm also fine with using multiple Regexes if a single one is difficult. But I like to avoid using standard C# operations unless they are simple and short. (I don't want to end up with writing lots of code.)

So, any suggestions?

Upvotes: 0

Views: 1363

Answers (4)

David Woodward
David Woodward

Reputation: 1273

I agree that regex is not the "right" answer, but it is what the question asked for and I like a good regex challenge.

The pattern below is a modified version of my standard CSV parsing regex that removes the spaces and assumes the CSV is perfect as you requested. The only part of your question not addressed by it is that it will not remove escaped/doubled quotes. Examples for unescaping the quotes are given after the patterns.


When one or more lines/records of a CSV file/stream are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record.


Here's the commented pattern (test it on Regexstorm.net):

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  [^\S\r\n]*                                  // Removes leading spaces
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>[^,\r\n]+)|                      // Unquoted/open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  [^\S\r\n]*                                  // Removes trailing spaces
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL


Here's the raw pattern without all the comments or whitespace.

(?<=\r|\n|^)(?!\r|\n|$)(?:[^\S\r\n]*(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>[^,\r\n]+)|(?<Value>))[^\S\r\n]*(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)


And, here's the C# escaped version.

String CSVPattern=
    @"(?<=\r|\n|^)(?!\r|\n|$)" +
    @"(?:" +
        @"[^\S\r\n]*" +
        @"(?:" +
            @"""(?<Value>(?:[^""]|"""")*)""|" +
            @"(?<Value>[^,\r\n]+)|" +
            @"(?<Value>)" +
        @")" +
        @"[^\S\r\n]*" +
        @"(?:,|(?=\r|\n|$))" +
    @")+?" +
    @"(?:(?<=,)(?<Value>))?" +
    @"(?:\r\n|\r|\n|$)";


Examples on how to use the regex pattern (well, the original pattern which can be replaced with this pattern) can be found on my answer to a similar question here, or on C# pad here, or here.

NOTE: The examples above contain the logic for unescaping/undoubling quotes as seen below:

if (Capture.Length == 0 || Capture.Index == Record.Index || Record.Value[Capture.Index - Record.Index - 1] != '\"')
{
    // No need to unescape/undouble quotes if the value is empty, the value starts
    // at the beginning of the record, or the character before the value is not a
    // quote (not a quoted value)
    Console.WriteLine(Capture.Value);
}
else
{
    // The character preceding this value is a quote
    // so we need to unescape/undouble any embedded quotes
    Console.WriteLine(Capture.Value.Replace("\"\"", "\""));
}

Upvotes: 1

Rajesh
Rajesh

Reputation: 7876

You can use the TextFieldParser class which is inbuilt in .NET framework.

In order to use that class in your C# application you would need to add reference of Microsoft.VisualBasic.dll at the following location(assuming you did a default settings installation)

C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.0\Microsoft.VisualBasic.dll

Now in your C# class have the below using statement:

using Microsoft.VisualBasic.FileIO

Upvotes: 0

Jigar Pandya
Jigar Pandya

Reputation: 5977

Well there are many gotchas and error possiable with Regexes... try following code it did trick for me and it is sweet and simple...

Using Reader As New Microsoft.VisualBasic.FileIO.TextFieldParser("C:\MyFile.csv")

Reader.TextFieldType = Microsoft.VisualBasic.FileIO.FieldType.Delimited

Dim MyDelimeters(0 To 0) As String
Reader.HasFieldsEnclosedInQuotes = False
Reader.SetDelimiters(","c)

Dim currentRow As String()
While Not Reader.EndOfData
    Try
        currentRow = Reader.ReadFields()
        Dim currentField As String
        For Each currentField In currentRow
            MsgBox(currentField)
        Next
    Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException
        MsgBox("Line " & ex.Message &
        "is not valid and will be skipped.")
    End Try
End While
End Using

Mark as answer if found handy ...;)

Please see the same implementation here,,,

Upvotes: 1

Peter
Peter

Reputation: 38455

i would not try to write my own csv parser there many out there that do the job for you.

Upvotes: 5

Related Questions