Emperor
Emperor

Reputation: 59

Apache common CSV formatter: IOException: invalid char between encapsulated token and delimiter

I am trying to parse a CSV file using JakartaCommons-csv

Sample input file

Field1,Field2,Field3,Field4,Field5
"Ryan, R"u"bianes","  [email protected]","29445","626","South delhi, Rohini 122001"

Formatter: CSVFormat.newFormat(',').withIgnoreEmptyLines().withQuote('"') CSV_DELIMITER is ,

Output

  1. Field1 value after CSV parsing should be : Ryan, R"u"bianes
  2. Field5 value after CSV parsing should be : South delhi, Rohini 122001

Exception: Caused by: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter

Upvotes: 3

Views: 5315

Answers (2)

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

The problem here is that the quotes are not properly escaped. Your parser doesn't handle that. Try univocity-parsers as this is the only parser for java I know that can handle unescaped quotes inside a quoted value. It is also 4 times faster than Commons CSV. Try this code:

    //configure the parser to handle your situation
    CsvParserSettings settings = new CsvParserSettings();
    settings.setHeaderExtractionEnabled(true); //uses first line as headers
    settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);
    settings.trimQuotedValues(true); //trim whitespace around values in quotes

    //create the parser
    CsvParser parser = new CsvParser(settings);

    String input = "" +
            "Field1,Field2,Field3,Field4,Field5\n" +
            "\"Ryan, R\"u\"bianes\",\"  [email protected]\",\"29445\",\"626\",\"South delhi, Rohini 122001\"";

    //parse your input
    List<String[]> rows = parser.parseAll(new StringReader(input));

    //print the parsed values
    for(String[] row : rows){
        for(String value : row){
            System.out.println('[' + value + ']');
        }
        System.out.println("-----");
    }

This will print:

[Ryan, R"u"bianes]
[[email protected]]
[29445]
[626]
[South delhi, Rohini 122001]
-----

Hope it helps.

Disclosure: I'm the author of this library, it's open source and free (Apache 2.0 license)

Upvotes: 0

Stephen C
Stephen C

Reputation: 718916

The problem is that your file is not following the accepted standard for quoting in CSV files. The correct way to represent a quote in a quoted string is by repeating the quote. For example.

Field1,Field2,Field3,Field4,Field5
"Ryan, R""u""bianes","  [email protected]","29445","626","South delhi, Rohini 122001"

If you restrict yourself to the standard form of CSV quoting, the Apache Commons CSV parser should work.

Unfortunately, it is not feasible to write a consistent parser for your variant format because there is no way disambiguate an embedded comma and a field separator if you need to represent a field containing "Ryan R","baines".

The rules for quoting in CSV files are set out in various places including RFC 4180.

Upvotes: 3

Related Questions