Can Jackson parse CSV files in which different rows have different schema?

Question

Is it possible to efficiently parse a CSV file using the Jackson jackson-dataformat-csv library (CsvSchema, CsvMapper, etc.) when different rows of that file have different schema?

I emphasize efficiently because I have very large files (>100,000,000 rows) to parse and the application is performance sensitive. If a new Object/String is instantiated for every column in every row the GC will disown me. I want primitives whenever possible, e.g., 31 to be returned as int.

If so, what is the recommended approach?

FYI, the file schema is like this: ROW_TYPE|.... That is, the first column of every row denotes the column type and for a given column type the schema is always the same. Then the rest of the columns differ between rows, depending on their column type. E.g.:

1|"text1a"|2|3|4|true|"text2a"
2|3|"text"
1|"text1b"|5|6|7|false|"text2b"

At the moment I use the neo4j-csv library.


    org.neo4j
    neo4j-csv
    2.2-SNAPSHOT

It is extremely performant and creates very little garbage. Also, it supports reading entries column by column, and for the type to be specified at every read - more involved, but more flexible. For reference, the usage is something like this:

// do once per file
CharSeeker charSeeker = new BufferedCharSeeker(...), bufferSize);
int columnDelimiter = '|';
Extractors extractors = new Extractors();
Mark mark = new Mark();

// do repeatedly while parsing
charSeeker.seek(mark, columnDelimiters))
int eventType = charSeeker.extract(mark, extractors.int_()).intValue();

switch (eventType) {
    case 1: // parse row type 1
            break;
    case 2: // parse row type 2
            break;
...
...
}

The reason I'm considering switching is I'd like to cut down on project dependencies and, as I already use Jackson for JSON, it makes sense to use it for CSV (performance/features pending).

Can Jackson parse CSV files in which different rows have different schema?

Answers (1)

Related Questions