Belizzle
Belizzle

Reputation: 1414

Regex to find missing double quote in csv

We are processing csv files which contain lines with non-closed double quoted entries. These blow up the csv parser, so I am trying to put together a regex which will identify these lines so we can delete them from the files before trying to process them.

In the following example, the csv parser gets to line 2 and includes everything up to the first double quote in line 3 before trying to close out the token and then blows up because there are non-whitespace characters after the "closing" double quote before the next comma.

Example Line 1,some data,"good line",processes fine,happy

Example Line 2,some data,"bad line,processes poorly,unhappy

Example Line 3,some data,"good line",dies before here,unhappy

I am trying to do something like:

.*,"[^(",)]*[\r\n]

The idea is finding a single line with anything followed by ," without an instance of ", which follows before the line ends.

The negation of the sequence is not working though. How is something like this done?

NOTE:

Since people keep suggesting essentially checking for an even number of double quotes, it's worth noting that a single double-quoted csv entry could contain a standalone double quote (e.g. ...,"Measurement: 1' 2"",...).

Upvotes: 1

Views: 3874

Answers (5)

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

Regex doesn't really work reliably for that as there are many edge cases. You should try univocity-parsers as it is the only CSV parser I know that handles unescaped quotes properly.

It gives you the following options:

  • STOP_AT_CLOSING_QUOTE - If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.

  • STOP_AT_DELIMITER - If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until a delimiter or line ending is found in the input.

  • SKIP_VALUE - If unescaped quotes are found in the input, the content parsed for the until the next delimiter is found, everything will, producing a null.

  • RAISE_ERROR - Throws an exception if unescaped quotes are found in the input

Use it like this:

CsvParserSettings settings = new CsvParserSettings();                   
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER);

CsvParser parser = new CsvParser(settings);
for(String row[] : parser.iterate(input)){
    System.out.println(Arrays.toString(row));
}

Hope it helps. By default it runs with the STOP_AT_DELIMITER setting.

Disclaimer: I'm the author of this library. It's open-source and free (Apache 2.0 license)

Upvotes: 0

Belizzle
Belizzle

Reputation: 1414

This was a regex someone else gave me the framework for that ended up working with a few modifications:

This will match anything followed by ," with or without spaces in between, not followed eventually by a ", (also with potential white space) and finally ending in a newline.

.*,[\s]*"(?!.*"[\s]*,).*\n

Upvotes: 0

zx81
zx81

Reputation: 41838

With your current requirements (including your concern about "Measurement: 1' 2"", this will select the bad lines:

^.*(?:^|,)[^",]*"(?:[^",]*(?:"[^",]*")?)+(?:$|,.*)
  1. The ^ anchors at the top of the string
  2. The .*(?:^|,) eats up any characters up to the top of the string or a comma
  3. We match a "...
  4. and, once or more times, [^",]*(?:"[^",]*")? matches characters that are neither a " or a comma, and, optionally, a balanced set of quotes: "[^",]*"
  5. We either match the end of the string, or a comma and anything that follows

A note about escaped double quotes

You may have, in your input, double-quoted strings that contain an escaped double quote, like this: "abc\"de" If so, we need to replace our expression for double-quoted strings (?:"[^",]*") with something more solid: (?:"(?:\\"|[^"])*")

Hence the whole regex would become:

^.*(?:^|,)[^",]*"(?:[^",]*(?:"(?:\\"|[^"])*")?)+(?:$|,.*)

Upvotes: 1

Federico Piazza
Federico Piazza

Reputation: 30995

You can use:

int count = str.length() - str.replaceAll("\\"","").length();

if (count % 2 == 0) {
    // do what you want
}

Upvotes: 1

Mr. Llama
Mr. Llama

Reputation: 20899

Something like this should work:

^[^"]*("[^"]*"[^"]*)*[^"]*$

The [^"]* that you see repeated all over the place means "any number of non-quote characters".
The ("[^"]*"[^"]*)* will match paired quotes while the [^"]*s will match the unquoted text before and after the final quotes.
The ^ and $ anchors ensure that we're matching the whole line, not just a portion of it.

Essentially: if there's an even number of quotes it will match. If there is an odd number of quotes, it will fail.

Here's an example of the regex in action.


If whatever solution you're working in has the option, there's a much simpler method that doesn't involve regular expressions. Simply count the number of double quotes in the CSV line. If it's odd, the line has a mismatched quote.

Upvotes: 0

Related Questions