Reputation: 11
We are evaluating avro v/s thrift for storage. At this point Avro seems to be our choice, however the documentation states that the schema is stored alongside the data when serialized, is there a way to avoid this, since we are incharge of both producing and consuming the data, we want to see if we can avoid serializing the schema, and also is the difference in size of the serialized data with the schema is much larger than just the data without schema?
Upvotes: 1
Views: 3209
Reputation: 11
A little late to the party, but you don't actually need to store the actual schema with each and every record. You do, however, need a way to get back to the original schema from each record's serialized format.
Thus, you could use a schema store + custom serializer that writes the avro record content and the schema id. On read, you can read back in that schema ID, retrieve it from the schema store and then use that schema to rehydrate the record content. Bonus points for using a local cache if your schema store is remote.
This is the exactly the approach that Oracle's NoSQL DB takes to managing schema in a storage efficient manner (its also available able under the AGPL license).
Full disclosure: currently and never previously employed by Oracle or Sun, or worked on the above store. Just came across it recently :)
Upvotes: 1
Reputation: 11
Thanks JGibel, Our data would eventually end up in HDFS eventually, and the object container file format does ensure that the schema is only written as a header on the file.
For uses other than HDFS, I was under the wrong assumption that the schema would be attached to every encoded data, but its not the case, meaning you need the schema to deserialize it, but the serialized data does not have to have the schema string attached to it.
E.g.
DatumWriter<TransactionInfo> eventDatumWriter = new SpecificDatumWriter<TransactionInfo>(TransactionInfo.class);
TransactionInfo t1 = getTransaction();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BinaryEncoder becoder = EncoderFactory.get().binaryEncoder(baos, null);
eventDatumWriter.setSchema(t1.getSchema());
eventDatumWriter.write(t1, becoder);
becoder.flush();
Upvotes: 0
Reputation: 257
I'm pretty sure you will always need the schema to be stored with the data. This is because Avro will use it when reading and writing to the .avro file.
According to http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html:
You apply a schema to the value portion of an Oracle NoSQL Database record using Avro bindings. These bindings are used to serialize values before writing them, and to deserialize values after reading them. The usage of these bindings requires your applications to use the Avro data format, which means that each stored value is associated with a schema.
As far as size difference, you only have to store the schema once, so in the big scheme of things, it doesn't make that much of a difference. My schema takes up 105.5KB (And that is a really large schema, yours shouldn't be that large) and each serialized value takes up 3.3KB. I'm not sure what the difference would be for just the raw json of the data, but according to that link I posted:
Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size.
But I believe that may just be for single, simple values.
This is on HDFS for me btw.
Upvotes: 0