Laurence
Laurence

Reputation: 7823

Best way to ignore Comma (,) in CSV files if the Comma(,) is in quotation marks?

I have a small program to read CSV files to build datatable out of it. One requirement is to ignore commas (commas in names, etc) if the commas are between quotation marks. Example.

          Name, Age, Location
          "Henderson, David", 32, London
           John Smith, 19, Belfast

The program should ignore the comma after Henderson and read Henderson, David as one field. My current code can't do this job adding extra column at the end. So How can I achieve it? The solution should not replace the comma between the quotation marks. Thanks.

My current code.

 Public Function BuildDataTable() As DataTable

    Dim myTable As DataTable = New DataTable("MyTable")
    Dim i As Integer
    Dim myRow As DataRow
    Dim fieldValues As String()        
    Dim myReader As StreamReader = New StreamReader(_fileFullPath, Encoding.GetEncoding("iso-8859-1"))

    Try           
        fieldValues = myReader.ReadLine().Split(_seperator)
        'Create data columns accordingly
        If _hasheader = False Then
            For i = 0 To fieldValues.Length() - 1
                myTable.Columns.Add(New DataColumn("Column(" & i & ")"))
            Next
        Else
            'if the file has header, take the first row as header for datatable
            For i = 0 To fieldValues.Length() - 1
                myTable.Columns.Add(New DataColumn(fieldValues(i).Replace(" ", "")))
            Next
        End If

        myRow = myTable.NewRow

        If _hasheader = False Then
            For i = 0 To fieldValues.Length() - 1
                myRow.Item(i) = fieldValues(i).ToString
            Next
            myTable.Rows.Add(myRow)
        End If

        While myReader.Peek() <> -1
            fieldValues = myReader.ReadLine().Split(_seperator)
            myRow = myTable.NewRow
            For i = 0 To fieldValues.Length() - 1
                myRow.Item(i) = fieldValues(i).Trim.ToString
            Next

            If Not csv2xml.AreAllColumnsEmpty(myRow) = True Then
                myTable.Rows.Add(myRow)
            End If

        End While
    Catch ex As Exception                   
    End Try     
End Function

Upvotes: 1

Views: 6374

Answers (2)

charlee
charlee

Reputation: 1369

I'm not familiar with Visual Basic but I think you should not use a Split() function to split the line.

fieldValues = myReader.ReadLine().Split(_seperator)    ' DO NOT do this

Instead, write your own split function, which reads each characters one by one. Then have a flag to record whether you are between the double quotation marks.


UPDATE

I'm sorry I know too little about VB or C# to write a runnable code sniplet. Please read this pseudocode (in fact it is JavaScript)...hope it is useful.

function split_with_quote(string, delimiter, quotation) {
    if (delimiter == null) delimiter = ',';
    if (quotation == null) quotation = '"';
    var in_quotation = false;
    var result = [];
    var part = '';
    for (var i = 0; i < string.length; i++) {
        var ch = string[i];
        if (ch == quotation)  in_quotation = !in_quotation;
        if (ch == delimiter && !in_quotation) {
            result.push(part);
            part = '';
        } else {
            if (ch != quotation) part += ch;
        }
    }
    return result;
}

a = 'abc,def,"ghi,jkl",123';
split_with_quote(a);    // ["abc", "def", "ghi,jkl"]

Upvotes: 0

Jay Riggs
Jay Riggs

Reputation: 53593

You're looking to use the double quote character as a text qualifier in your CSV. Text qualifers allow you to use your field delimiter character(s) in a field value if the field is enclosed in the text qualifier character.

You can progam this yourself but that would be a mistake. There are plenty of free and capable CSV parsers that can do this for you. Since you're using Visual Basic you can take a look at the TextFieldParser class.

You'll still need to write code that will write a CSV's contents into a DataTable.

I found the following that seems to work:
http://www.vbcode.com/asp/showsn.asp?theID=13645

Another option is the GenericParser over at codeproject.com. Don't let the fact that the code in the article is written in C# bother you; you can still reference the DLL (GenericParsing.dll) in your project and use it in VB.

The nice thing about this parser is it includes a method you can use to return a DataTable for you from a CSV. Here's an example which works with your sample data:

Using parser As New GenericParsing.GenericParserAdapter(CSV_FILE_FULLNAME)
    parser.ColumnDelimiter = ","
    parser.TextQualifier = """"
    parser.FirstRowHasHeader = True
    Dim dt As DataTable = parser.GetDataTable()
End Using

Upvotes: 3

Related Questions