Reputation: 1189
I am attempting to parse numerous CSV files (comma-separated, UTF-8 encoding) and have encountered a recurring issue.
Consider a scenario where I have three fields with the values: : A, "B", C
. According to RFC 4180
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
Based on this, my understanding is that the correct CSV representation should be A,"""B""",C
.However, many files generated by various tools are formatted as: : "A, ""B"", C"
This causes CSV parsers (CsvHelper in c# for instance) to interpret such lines as a single field, rather than three separate fields.
Am I missing something here? Why is this seemingly "incorrect" format so prevalently used across different tools?
For discussion, here is a more realistic example:
"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023,""BAS"",1 111 000.27,""NRT"",""Test, ok"","""","""","""","""","""""
I need to read
Upvotes: 0
Views: 450
Reputation: 9074
CsvMode.Escape
comes close to what you are looking for. It works for your very simple example of "A, ""B"", C"
. However, for ""Test, ok""
, it creates two fields "Test
and ok"
, which I suspect is supposed to be one field. But maybe I'm wrong and that does work for you, so I thought I would at least suggest it.
var config = new CsvConfiguration(CultureInfo.InvariantCulture) {
Mode = CsvMode.Escape
};
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, config))
I tend to agree with @PanagiotisKanavos that this file has been encoded twice. Which is why I would suggest reading it twice. Read it first as if each line is supposed to be a single field and then read that field to get the record.
This removes those double quotes that you say are supposed to be in the field, but I'm not convinced they are supposed to be a part of the data.
void Main()
{
var sb = new StringBuilder();
sb.Append("\"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023");
sb.Append(",\"\"BAS\"\",1 111 000.27,\"\"NRT\"\",\"\"Test, ok\"\"");
sb.Append(",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\"\"");
sb.AppendLine();
sb.Append("\"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023");
sb.Append(",\"\"BAS\"\",1 111 000.27,\"\"NRT\"\",\"\"Test, ok\"\"");
sb.Append(",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\"\"");
var records = new List<Foo>();
var config = new CsvConfiguration(CultureInfo.InvariantCulture) {
HasHeaderRecord = false
};
using (var reader = new StringReader(sb.ToString()))
using (var csv = new CsvReader(reader, config))
{
while(csv.Read())
{
var line = csv.GetRecord<SingleLine>().Line;
using (var reader2 = new StringReader(line))
using (var csv2 = new CsvReader(reader2, config))
{
if(csv2.Read())
{
var options = new TypeConverterOptions { Formats = new[] { "dd/MM/yyyy" } };
csv2.Context.TypeConverterOptionsCache.AddOptions<DateTime>(options);
var record = csv2.GetRecord<Foo>();
records.Add(record);
}
}
}
}
records.Dump();
}
public class SingleLine
{
public string Line { get; set; }
}
public class Foo
{
[Index(0)]
public string Field0 { get; set; }
[Index(1)]
public DateTime Field1 { get; set; }
[Index(2)]
public DateTime Field2 { get; set; }
[Index(3)]
public DateTime Field3 { get; set; }
[Index(4)]
public DateTime Field4 { get; set; }
[Index(5)]
public string Field5 { get; set; }
[Index(6)]
public string Field6 { get; set; }
[Index(7)]
public string Field7 { get; set; }
[Index(8)]
public string Field8 { get; set; }
[Index(9)]
public string Field9 { get; set; }
[Index(10)]
public string Field10 { get; set; }
[Index(11)]
public string Field11 { get; set; }
[Index(12)]
public string Field12 { get; set; }
[Index(13)]
public string Field13 { get; set; }
}
Upvotes: 1
Reputation: 415820
Rather than a specific code issue, I'll just address this high-level question:
Why Common Tools Deviate from RFC 4180 Standards?
The answer is this standard wasn't published until 2005, but CSV data has been in use for FAR LONGER. For much of the time we've been using CSV data, there was no standard! (And we got by just fine, thankyouverymuch).
This means there is not only a LOT of non-compliant CSV data still in the wild, but also still a significant number of legacy systems continuing to produce new non-compliant outputs today, and even new applications created to match non-compliant behavior so they function within a larger system (a complete Ship of Theseus of non-compliant CSV).
CSV parsers need to be able to handle this.
Upvotes: 1