shantanuo
shantanuo

Reputation: 32326

Find non-valid lines from a csv file

The quotation marks are not matching in the following file.

# cat t123.txt
"first", "second", "and last
"second", "line", "ok"
"third", "line", "not, "ok"

Only the second line is OK. How do I find the first and third line that do not have consistent quotation marks?

I have tried this based on an article that I found. But it does not return the expected results...

https://regex101.com/r/nhDKA2/4

Upvotes: 0

Views: 34

Answers (1)

Peter Thoeny
Peter Thoeny

Reputation: 7616

Strictly speaking, your second line is not standard CSV, which does not support a space after the comma.

You can use this regex to test for valid lines based on your CSV spec:

^(?="[^"]*(", "[^"]*)*"$).*"$

  • ^(?= ... ) - positive lookahead at the beginning for:
    • "[^"]* - one quote, and anything non-quote
    • (", "[^"]*)* - zero or more patterns of ", "...
    • "$ - expect " at the end
  • .*$" - whole pattern must end in "

Notes on this regex:

  • it supports one to many cells
  • it does not handle escaped quotes within a cell, such as "this is a ""quote"" in a cell"
  • it does not support quote-less cells, such as the 99 in "foo",99,"bar", which is valid in CSV

Upvotes: 1

Related Questions