notacyborg
notacyborg

Reputation: 115

skip malformed csv row

I have been trying to read a csv and add fields to a Data Structure. But, one of the row is not formed properly, and I am aware of that. I just want to skip the row and move on to another. But, even though I am catching the exception, It's still breaking the loop. Any idea what I am missing here?

My csv:

"id","name","email"
121212,"Steve","[email protected]"
121212,"Steve","[email protected]",,
121212,"Steve","[email protected]"

My code:

import com.fasterxml.jackson.databind.MappingIterator;
import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.CsvSchema;

public static void main(String[] args) throws Exception{
    Path path = Paths.get("list2.csv");
    CsvMapper mapper = new CsvMapper();
    CsvSchema schema = CsvSchema.emptySchema().withHeader();
    MappingIterator<Object> it = mapper.reader(Object.class)
            .with(schema)
            .readValues(path.toFile());

    try{
        while(it.hasNext()){
            Object row;
            try{
                row = it.nextValue();
            } catch (IOException e){
                e.printStackTrace();
                continue;
            }
        }
    } catch (ArrayIndexOutOfBoundsException e){
        e.printStackTrace();
    }

}

Exception:

com.fasterxml.jackson.core.JsonParseException: Too many entries: expected at most 3 (value #3 (0 chars) "")
 at [Source: java.io.InputStreamReader@12b3519c; line: 3, column: 38]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1486)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
    at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntryExpectEOL(CsvParser.java:601)
    at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntry(CsvParser.java:587)
    at com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:474)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.mapObject(UntypedObjectDeserializer.java:592)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:440)
    at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:188)
    at CSVTest.main(CSVTest.java:24)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
java.lang.ArrayIndexOutOfBoundsException: 3
    at com.fasterxml.jackson.dataformat.csv.CsvSchema.column(CsvSchema.java:941)
    at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNamedValue(CsvParser.java:614)
    at com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:476)
    at com.fasterxml.jackson.databind.MappingIterator.hasNextValue(MappingIterator.java:158)
    at CSVTest.main(CSVTest.java:21)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

Upvotes: 7

Views: 7598

Answers (4)

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

Your CSV is not necessarily malformed, in fact it's very common to have rows with varying number of columns.

univocity-parsers handles this without any trouble.

The easiest way would be:

BeanListProcessor<TestBean> rowProcessor = new BeanListProcessor<TestBean>(TestBean.class);

CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);

CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader(Paths.get("list2.csv").toFile());

// The BeanListProcessor provides a list of objects extracted from the input.
List<TestBean> beans = rowProcessor.getBeans();

If you want to discard the elements built using a row with inconsistent number of column, override the beanProcessed method and use the ParsingContext object to analyse your data and decide whether to keep or drop the row.

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Upvotes: 2

StaxMan
StaxMan

Reputation: 116522

With Jackson 2.6 handling of readValues() has been improved to try to recover from processing errors, such that in many cases you can just try again, to read following valid rows. So make sure to use at least version 2.6.2.

Earlier versions did not recover as well, usually rendering rest of the content unprocessable; this may be what happened in your case.

Another possibility, given that your problem is not with invalid CSV, but rather one not mappable as POJOs (at least the way as POJO is defined), is to read content as a sequence of String[], and handling mapping manually. Jackson's CSV parser itself does not mind any number of columns, it is the higher level databinding that does like finding "extra" content that it does not recognize.

Upvotes: 2

dsh
dsh

Reputation: 12214

I can't tell for certain since some of the stack trace was omitted, however:

  • If ArrayIndexOutOfBoundsException is the exception that is thrown (as opposed to being a "cause") then the reason is that you catch it outside of your loop.
  • If the exception is a (subclass of) IOException, then as Chris Gerken wrote it may be thrown in it.hasNext(), in which case you don't catch it at all and so your program will exit.

The remainder of the stack trace would indicate which of these, or some other reason altogether, is the problem.



Update based on complete output and stack traces:

On line 24 of CSVTest.java, you call .nextValue(). In the implementation of calling this method, a JsonParseException is thrown. Since that is a subclass of IOException, your catch block catches it, prints the stack trace and continues with your loop. So far so good.

com.fasterxml.jackson.core.JsonParseException: Too many entries: expected at most 3 (value #3 (0 chars) "")
 at [Source: java.io.InputStreamReader@12b3519c; line: 3, column: 38]
   at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1486)
   at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
   at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntryExpectEOL(CsvParser.java:601)
   at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntry(CsvParser.java:587)
   at com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:474)
   at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.mapObject(UntypedObjectDeserializer.java:592)
   at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:440)
   at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:188)
   at CSVTest.main(CSVTest.java:24)

After that, on line 21 of CSVTest.java, you call .hasNextValue(). In the implementation of this method, an ArrayIndexOutOfBoundsException is thrown. You catch it, and also print the stack trace. However your catch block is outside of your loop, and so by the time you catch the exception the loop has already been exited.

java.lang.ArrayIndexOutOfBoundsException: 3
    at com.fasterxml.jackson.dataformat.csv.CsvSchema.column(CsvSchema.java:941)
    at com.fasterxml.jackson.dataformat.csv.CsvParser._handleNamedValue(CsvParser.java:614)
    at com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:476)
    at com.fasterxml.jackson.databind.MappingIterator.hasNextValue(MappingIterator.java:158)
    at CSVTest.main(CSVTest.java:21)

If you really want to continue your loop here, then you will need to move that try-catch construct inside the loop. Perhaps like this:

while (true)
    {
    try
        {
        if (!it.hasNextValue())
            { break; }
        }
    catch (final ArrayIndexOutOfBoundsException err)
        {
        err.printStackTrace();
        continue;
        }

    Object row;
    try
        { row = it.nextValue(); }
    catch (final IOException err)
        {
        err.printStackTrace();
        continue;
        }
    }

However, this code is an infinite loop. When hasNextValue() throws an ArrayIndexOutOfBoundsException, the state has not changed the loop will never end. I show this to show the principle of moving the catch block inside the loop, not as a workable resolution.

You added a comment to the question referencing discussion of error handling in jackson-dataformat-csv. It appears that you encountered a limitation (or bug) in the library when it comes to skipping malformed rows.

Upvotes: 0

Chris Gerken
Chris Gerken

Reputation: 16392

com.fasterxml.jackson.core.JsonParseException is an IOException so that exception should be caught in the try-catch block. The fact that it is not being caught leads me to believe that it's happening in the hasNext() method. That's a common pattern: in order to know whether there is another you actually have to try to read the next one.

Upvotes: 1

Related Questions