Reputation: 1414
We are processing csv files which contain lines with non-closed double quoted entries. These blow up the csv parser, so I am trying to put together a regex which will identify these lines so we can delete them from the files before trying to process them.
In the following example, the csv parser gets to line 2 and includes everything up to the first double quote in line 3 before trying to close out the token and then blows up because there are non-whitespace characters after the "closing" double quote before the next comma.
Example Line 1,some data,"good line",processes fine,happy
Example Line 2,some data,"bad line,processes poorly,unhappy
Example Line 3,some data,"good line",dies before here,unhappy
I am trying to do something like:
.*,"[^(",)]*[\r\n]
The idea is finding a single line with anything followed by ," without an instance of ", which follows before the line ends.
The negation of the sequence is not working though. How is something like this done?
NOTE:
Since people keep suggesting essentially checking for an even number of double quotes, it's worth noting that a single double-quoted csv entry could contain a standalone double quote (e.g. ...,"Measurement: 1' 2"",...).
Upvotes: 1
Views: 3874
Reputation: 6289
Regex doesn't really work reliably for that as there are many edge cases. You should try univocity-parsers as it is the only CSV parser I know that handles unescaped quotes properly.
It gives you the following options:
STOP_AT_CLOSING_QUOTE - If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
STOP_AT_DELIMITER - If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until a delimiter or line ending is found in the input.
SKIP_VALUE - If unescaped quotes are found in the input, the content parsed for the until the next delimiter is found, everything will, producing a null
.
RAISE_ERROR - Throws an exception if unescaped quotes are found in the input
Use it like this:
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER);
CsvParser parser = new CsvParser(settings);
for(String row[] : parser.iterate(input)){
System.out.println(Arrays.toString(row));
}
Hope it helps. By default it runs with the STOP_AT_DELIMITER
setting.
Disclaimer: I'm the author of this library. It's open-source and free (Apache 2.0 license)
Upvotes: 0
Reputation: 1414
This was a regex someone else gave me the framework for that ended up working with a few modifications:
This will match anything followed by ," with or without spaces in between, not followed eventually by a ", (also with potential white space) and finally ending in a newline.
.*,[\s]*"(?!.*"[\s]*,).*\n
Upvotes: 0
Reputation: 41838
With your current requirements (including your concern about "Measurement: 1' 2""
, this will select the bad lines:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"[^",]*")?)+(?:$|,.*)
^
anchors at the top of the string.*(?:^|,)
eats up any characters up to the top of the string or a comma[^",]*(?:"[^",]*")?
matches characters that are neither a " or a comma, and, optionally, a balanced set of quotes: "[^",]*"
A note about escaped double quotes
You may have, in your input, double-quoted strings that contain an escaped double quote, like this: "abc\"de"
If so, we need to replace our expression for double-quoted strings (?:"[^",]*")
with something more solid: (?:"(?:\\"|[^"])*")
Hence the whole regex would become:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"(?:\\"|[^"])*")?)+(?:$|,.*)
Upvotes: 1
Reputation: 30995
You can use:
int count = str.length() - str.replaceAll("\\"","").length();
if (count % 2 == 0) {
// do what you want
}
Upvotes: 1
Reputation: 20899
Something like this should work:
^[^"]*("[^"]*"[^"]*)*[^"]*$
The [^"]*
that you see repeated all over the place means "any number of non-quote characters".
The ("[^"]*"[^"]*)*
will match paired quotes while the [^"]*
s will match the unquoted text before and after the final quotes.
The ^
and $
anchors ensure that we're matching the whole line, not just a portion of it.
Essentially: if there's an even number of quotes it will match. If there is an odd number of quotes, it will fail.
Here's an example of the regex in action.
If whatever solution you're working in has the option, there's a much simpler method that doesn't involve regular expressions. Simply count the number of double quotes in the CSV line. If it's odd, the line has a mismatched quote.
Upvotes: 0