kikulikov
kikulikov

Reputation: 2582

Regex to match invalid CSV line with unescaped quotes

Let's say I have a file of strings like

11,"abc","def"
12,"ab "c"","def" // invalid
13,"ab,"c"","def" // invalid
14,""a" b,c","def" // invalid
15,""a", "b"c","def" // invalid

As you can see some of the double quotes are unescaped. I'd like to filter out invalid strings before I try to parse them.

I'm thinking to do something like \,\".+\"\, to find a token and then to check that it doesn't contain "," inside. But I can't figure out how to make it work.

I've searched in SO but haven't found an answer which works for me.

Thank you.

Upvotes: 1

Views: 352

Answers (1)

m.cekiera
m.cekiera

Reputation: 5385

If String always start and end with ", you can try with this Java regex:

(?<=,\s{0,99}"|(?!\A)\G)[^"]+|(?<=(?!\A)\G|")(")(?!\s*[,\n]|$)

DEMO

the group 1 capture invalid quotes, you can get the indices with matcher.start(1) and matcher.end(1). \s{0,99} will work only in Java.

Upvotes: 1

Related Questions