Reputation: 3325
Is it possible to efficiently parse a CSV file using the Jackson jackson-dataformat-csv
library (CsvSchema
, CsvMapper
, etc.) when different rows of that file have different schema?
I emphasize efficiently because I have very large files (>100,000,000 rows) to parse and the application is performance sensitive. If a new Object
/String
is instantiated for every column in every row the GC will disown me. I want primitives whenever possible, e.g., 31
to be returned as int
.
If so, what is the recommended approach?
FYI, the file schema is like this: ROW_TYPE|...
. That is, the first column of every row denotes the column type and for a given column type the schema is always the same. Then the rest of the columns differ between rows, depending on their column type. E.g.:
1|"text1a"|2|3|4|true|"text2a"
2|3|"text"
1|"text1b"|5|6|7|false|"text2b"
At the moment I use the neo4j-csv
library.
<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j-csv</artifactId>
<version>2.2-SNAPSHOT</version>
</dependency>
It is extremely performant and creates very little garbage. Also, it supports reading entries column by column, and for the type to be specified at every read - more involved, but more flexible. For reference, the usage is something like this:
// do once per file
CharSeeker charSeeker = new BufferedCharSeeker(...), bufferSize);
int columnDelimiter = '|';
Extractors extractors = new Extractors();
Mark mark = new Mark();
// do repeatedly while parsing
charSeeker.seek(mark, columnDelimiters))
int eventType = charSeeker.extract(mark, extractors.int_()).intValue();
switch (eventType) {
case 1: // parse row type 1
break;
case 2: // parse row type 2
break;
...
...
}
The reason I'm considering switching is I'd like to cut down on project dependencies and, as I already use Jackson for JSON, it makes sense to use it for CSV (performance/features pending).
Upvotes: 3
Views: 2593
Reputation: 116522
Although Jackson does not have automated support for switching between CsvSchema
s on per-row basis (which would suggest you'd need to do two-phase processing; first read or bind as String[]
, then use ObjectMapper.convertValue()
), it might be possible to use existing support for polymorphic deserialization. This would rely on some commonality on column naming, so I don't know if it is realistic or not.
Assuming it'd work, you would need a base class that has property that matches logical name of first column to use; and then subtypes with similarly matching property names.
You would use @JsonTypeInfo
on base class, and use 'name' as type id; and either use @JsonTypeName
on sub-classes, or refer from base class using @JsonSubTypes
annotations.
That is, use the usual Jackson configuration.
If this does not work, two-phase processing may not be a bad choice. It would result in all cell values being read as distinct objects, but as long as they are not retained (that is, you only keep one rows full of data in memory), short-term garbage usually is not that problematic for GC (long-term garbage is the expensive kind).
Upvotes: 1