Flexible schema possible with ORC or Parquet format?

Question

My Java application consumes real-time data and then publishes to an ORC file on S3

The problem is that as we don't know the schema of the file until we process all records, as opposed to the first record

For example:

Message 1 has attributes A & B
Message 2 has attributes A, B & C
Message 3 has attributes A & C

Because this is a real-time application I don't wish to process all messages to work out the schema, as that would be quite slow

Is it possible to add to the schema as we process the data?

I've had a look at the Java examples here but I don't see a way

Would Parquet be any better here?

Jens Roland · Accepted Answer

I think you may be trying to fit a round peg in a square hole. It sounds like you are ingesting a stream of events with an unknown schema, and you would like to store it in a format that optimizes for a known schema.

I suppose you could buffer a set number of events (say, 1 million events) while keeping track of the schema, then purge to a file once the number is reached and begin buffering again. The drawback is each file will end up with a different schema, making it impractical to process data across multiple files.

A different solution would be to look into schemaless data stores, although you don't get the same price-performance benefits as with ORC or Parquet on S3.

There are other strategies as well, but your best bet for a long term solution is to talk to whomever manages the source of the events you are ingesting and find a way to determine the schema up front.

Flexible schema possible with ORC or Parquet format?

Answers (1)

Related Questions