Alex Averbuch
Alex Averbuch

Reputation: 3325

Can Jackson parse CSV files in which different rows have different schema?

Is it possible to efficiently parse a CSV file using the Jackson jackson-dataformat-csv library (CsvSchema, CsvMapper, etc.) when different rows of that file have different schema?

I emphasize efficiently because I have very large files (>100,000,000 rows) to parse and the application is performance sensitive. If a new Object/String is instantiated for every column in every row the GC will disown me. I want primitives whenever possible, e.g., 31 to be returned as int.

If so, what is the recommended approach?

FYI, the file schema is like this: ROW_TYPE|.... That is, the first column of every row denotes the column type and for a given column type the schema is always the same. Then the rest of the columns differ between rows, depending on their column type. E.g.:

1|"text1a"|2|3|4|true|"text2a"
2|3|"text"
1|"text1b"|5|6|7|false|"text2b"

At the moment I use the neo4j-csv library.

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j-csv</artifactId>
    <version>2.2-SNAPSHOT</version>
</dependency>

It is extremely performant and creates very little garbage. Also, it supports reading entries column by column, and for the type to be specified at every read - more involved, but more flexible. For reference, the usage is something like this:

// do once per file
CharSeeker charSeeker = new BufferedCharSeeker(...), bufferSize);
int columnDelimiter = '|';
Extractors extractors = new Extractors();
Mark mark = new Mark();

// do repeatedly while parsing
charSeeker.seek(mark, columnDelimiters))
int eventType = charSeeker.extract(mark, extractors.int_()).intValue();

switch (eventType) {
    case 1: // parse row type 1
            break;
    case 2: // parse row type 2
            break;
...
...
}

The reason I'm considering switching is I'd like to cut down on project dependencies and, as I already use Jackson for JSON, it makes sense to use it for CSV (performance/features pending).

Upvotes: 3

Views: 2593

Answers (1)

StaxMan
StaxMan

Reputation: 116522

Although Jackson does not have automated support for switching between CsvSchemas on per-row basis (which would suggest you'd need to do two-phase processing; first read or bind as String[], then use ObjectMapper.convertValue()), it might be possible to use existing support for polymorphic deserialization. This would rely on some commonality on column naming, so I don't know if it is realistic or not.

Assuming it'd work, you would need a base class that has property that matches logical name of first column to use; and then subtypes with similarly matching property names. You would use @JsonTypeInfo on base class, and use 'name' as type id; and either use @JsonTypeName on sub-classes, or refer from base class using @JsonSubTypes annotations. That is, use the usual Jackson configuration.

If this does not work, two-phase processing may not be a bad choice. It would result in all cell values being read as distinct objects, but as long as they are not retained (that is, you only keep one rows full of data in memory), short-term garbage usually is not that problematic for GC (long-term garbage is the expensive kind).

Upvotes: 1

Related Questions