Reputation: 1046
I have inherited some code that uses regular expressions to parse CSV formatted data. It didn't need to cope with empty string fields before now, however the requirements have changed so that empty string fields are a possibility.
I have changed the regular expression from this:
new Regex("((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))");
to this
new Regex("((?<field>[^\",\\r\\n]*)|\"(?<field>([^\"]|\"\")*)\")(,|(?<rowbreak>\\r\\n|\\n|$))");
(i.e. I have changed the + to *)
The problem is that I am now getting an extra empty field at the end, e.g. "ID,Name,Description" returns me four fields: "ID", "Name", "Description" and ""
Can anyone spot why?
Upvotes: 0
Views: 1733
Reputation: 93010
The problem with your regex is that it matches the empty string.
Now $
works a little like lookahead - it guarantees that the match is at the end of the string, but is not part of the match.
So when you have "ID,Name,Description", your first match is
ID,
, and the rest is "Name,Description"
Then the next match is
Name,
and the rest is "Description"
The next match:
Description
and the rest is ""
So the final match is matching the empty string.
Upvotes: 1
Reputation: 111820
This one:
var rx = new Regex("((?<=^|,)(?<field>)(?=,|$)|(?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")*)\")(,|(?<rowbreak>\\r\\n|\\n|$))");
I move the handling of "blank" fields to a third "or". Now, the handling of ""
already worked (and you didn't need to modify it, it was the second (?<field>)
block of your code), so what you need to handle are four cases:
,
,Id
Id,
Id,,Name
And this one should do it:
(?<=^|,)(?<field>)(?=,|$)
An empty field must be preceeded by the beginning of the row ^
or by a ,
, must be of length zero (there isn't anything in the (?<field>)
capture) and must be followed by a ,
or by the end of the line $
.
Upvotes: 2
Reputation: 57172
I would suggest you to use the FileHelpers library. It is easy to use, does its job and maintaining your code will be much easier.
Upvotes: 1