Reputation: 14270
I have lots of CSV data that I am trying to decode using regex. I am actually tried to build on an existing code base that other people/projects hit and dont want to risk breaking their data flows by refactoring the class too much. So, I was wondering if it is possible to decode this text with a single regex (which is how the class works currently):
f1,f2,f3,f4,f5,f6,f7
,"clean text","with,embedded,commas.","with""embedded""double""quotes",,"6.1",
First row is the header. If I save this as xxx.csv and open in Excel, it properly decompiles it to read (note the space between the fields are the cell breaks):
f1 f2 f3 f4 f5 f6 f7
clean text with,embedded,commas. with"embedded"double"quotes 6.1
But when I try this in .net, I get stuck on the regex. I have this:
string regExp = "(((?<x>(?=[,\\r\\n]+))|\"(?<x>([^\"]|\"\")+)\"|(?<x>[^,\\r\\n]+)),?)";
You can see it in action here:
Which results in this:
<start>
clean text
with,embedded,commas.
with""embedded""double""quotes
6.1
<end>
This is very close but it does not replace the escaped double-double quotes with a single-double quote like Excel does. I could not come up with a regex that worked better. Can it be done?
Upvotes: 1
Views: 193
Reputation: 1884
Maybe you can somehow manage to match your string using regular-expression-conditionals with the following constructors:
(?(?=regex)then|else)
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
I came up with the following pattern in order to match the body of your text: ([^\,]+(?(?=[^\,])([^\"]+")|([^\,]+,)))
, however, you will need to put an extra effort in order to create a completly matching expression for your text or end up using a file parser. If so, You can take a look at FileHelpers, a pretty neat library for parsing text files.
Sources:
Upvotes: 1