Reputation: 88092
We have an integration with another system that relies on passing CSV files back and forth (really old school).
The structure is generally:
ID, Name, PhoneNumber, comments, fathersname
1, tom, 555-1234, just some random text, bill
2, jill smith, 555-4234, other random text, richard
Every so often we see this:
3, jacked up, 999-1231, here
be dragons
amongst us, ted
The primary problem I care about is detecting that a line breaker (\n
) occurs in the middle of the record when that is the record terminator.
Is there anyway I can preprocess this to reliably fix it?
Note that we have zero control over what the other system emits.
Upvotes: 3
Views: 314
Reputation: 12468
Firstly I would recommend using a tool to manage reading and writing your csv files, I use the FileHelpers library which is great.
You can essentially type your records and it will do all the validation and such for you. Worth the effort.
To your question perhaps you can do some preprocessing on the file and use Regex to replace any line breaks with a space?
I do something similar (not with files but) try
line.Replace(Environment.NewLine, " ");
With FileHelpers you could write a custom converter to do this during processing, or hook into the BeforeRead event.
Upvotes: 0
Reputation: 1371
Let me start and say that the CSV file in your example is invalid. If a line break occurs inside a string, it should be wrapped with double quote characters.
Now for the answer - In order to parse this invalid csv format you must do several assumptions. In this case I made 2 assumptions: 1) The ID column must be numeric 2) The comment field can not contain digits.
Based on these assumptions you can check the first character after the line break character. If it is digit, you assume its a new record. If not you should treat it as a continue value of the comment field.
I don't know if the second assumption is valid, if not, you can enhance the logic so it will cover the business rules of the system.
Good Luck!
Upvotes: 0
Reputation: 203850
So you should be able to do something more or less like this:
for (int i = 0; i < lines.Count; i++)
{
var fields = lines[i].Split(',').ToList();
while (fields.Count < numFields)//here be dragons amonst us
{
i++;//include next line in this line
//check to make sure we haven't run out of lines.
//combine end of previous field with start of the next one,
//and add the line break back in.
var innerFields = lines[i].Split(',');
fields[fields.Count - 1] += "\n" + innerFields[0];
fields.AddRange(innerFields.Skip(1));
}
//we now know we have a "real" full line
processFields(fields);
}
(For simplicity I assumed all lines were read in at the start; I assume you could alter it to lazily fetch each line easily enough.)
Upvotes: 1