Understanding CSV Formatting : Why Common Tools Deviate from RFC 4180 Standards?

Question

I am attempting to parse numerous CSV files (comma-separated, UTF-8 encoding) and have encountered a recurring issue.

Consider a scenario where I have three fields with the values: : A, "B", C. According to RFC 4180

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote

Based on this, my understanding is that the correct CSV representation should be A,"""B""",C .However, many files generated by various tools are formatted as: : "A, ""B"", C"

This causes CSV parsers (CsvHelper in c# for instance) to interpret such lines as a single field, rather than three separate fields.

Am I missing something here? Why is this seemingly "incorrect" format so prevalently used across different tools?

For discussion, here is a more realistic example:

"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023,""BAS"",1 111 000.27,""NRT"",""Test, ok"","""","""","""","""","""""

I need to read

field 0: id
field 1 to 4 : date
field 5 : date
field 6 : string
field 7 and + : string (can contains double quotes)

David Specht · Accepted Answer

CsvMode.Escape comes close to what you are looking for. It works for your very simple example of "A, ""B"", C". However, for ""Test, ok"", it creates two fields "Test and ok", which I suspect is supposed to be one field. But maybe I'm wrong and that does work for you, so I thought I would at least suggest it.

var config = new CsvConfiguration(CultureInfo.InvariantCulture) {
    Mode = CsvMode.Escape   
};
using (var reader = new StreamReader("path\to\file.csv"))
using (var csv = new CsvReader(reader, config))

I tend to agree with @PanagiotisKanavos that this file has been encoded twice. Which is why I would suggest reading it twice. Read it first as if each line is supposed to be a single field and then read that field to get the record.

This removes those double quotes that you say are supposed to be in the field, but I'm not convinced they are supposed to be a part of the data.

void Main()
{
    var sb = new StringBuilder();
    sb.Append("\"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023");
    sb.Append(",\"\"BAS\"\",1 111 000.27,\"\"NRT\"\",\"\"Test, ok\"\"");
    sb.Append(",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\"\"");
    sb.AppendLine();
    sb.Append("\"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023");
    sb.Append(",\"\"BAS\"\",1 111 000.27,\"\"NRT\"\",\"\"Test, ok\"\"");
    sb.Append(",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\"\"");

    var records = new List();

    var config = new CsvConfiguration(CultureInfo.InvariantCulture) {
        HasHeaderRecord = false
    };
    
    using (var reader = new StringReader(sb.ToString()))
    using (var csv = new CsvReader(reader, config))
    {
        while(csv.Read())
        {
            var line = csv.GetRecord().Line;

            using (var reader2 = new StringReader(line))
            using (var csv2 = new CsvReader(reader2, config))
            {
                if(csv2.Read())
                {
                    var options = new TypeConverterOptions { Formats = new[] { "dd/MM/yyyy" } };
                    csv2.Context.TypeConverterOptionsCache.AddOptions(options);

                    var record = csv2.GetRecord();
                    records.Add(record);
                }               
            }
        }       
    }
    records.Dump();
}

public class SingleLine
{
    public string Line { get; set; }
}

public class Foo
{
    [Index(0)]
    public string Field0 { get; set; }
    [Index(1)]
    public DateTime Field1 { get; set; }
    [Index(2)]
    public DateTime Field2 { get; set; }
    [Index(3)]
    public DateTime Field3 { get; set; }
    [Index(4)]
    public DateTime Field4 { get; set; }
    [Index(5)]
    public string Field5 { get; set; }
    [Index(6)]
    public string Field6 { get; set; }
    [Index(7)]
    public string Field7 { get; set; }
    [Index(8)]
    public string Field8 { get; set; }
    [Index(9)]
    public string Field9 { get; set; }
    [Index(10)]
    public string Field10 { get; set; }
    [Index(11)]
    public string Field11 { get; set; }
    [Index(12)]
    public string Field12 { get; set; }
    [Index(13)]
    public string Field13 { get; set; }
}

Understanding CSV Formatting : Why Common Tools Deviate from RFC 4180 Standards?

Answers (2)

Related Questions