Avro, Schema Evolution, Backward-Compatibility

Question

Avro specification declares that extending a schema with optional fields is backward-compatible. Unfortunately, it does not work with binary streams for us and I do not know how to fix it. I have written a simple demo app to demonstrate the problem. The producer creates an instance, serializes it, and saves it to a file and the consumer reads the file and deserializes the stream. If the optional field is added to the schema and the schema is compiled (maven plugin), then serialized instances based on the previous version of the schema cannot be deserialized. The behavior is different when we use DataFileWriter/Reader, in that case it works but we need binary streams as we use kafka and the messages contain the serialized data.

How can we achieve backward-compatibility?

schema (version1): (test.avsc)

{
  "name": "Test1",
  "type": "record",
  "namespace": "model",
  "fields": [
    { "name": "userid", "type": "string" }
  ]
}

schema (version2): (test.avsc)

{
  "name": "Test1",
  "type": "record",
  "namespace": "model",
  "fields": [
    { "name": "userid", "type": "string" },
    { "name": "department", "type": ["null", "string"], "default": null }
  ]
}

pom.xml

[...]
  
            org.apache.avro
            avro
            ${avro.version}
  
[...]

           
                org.apache.avro
                avro-maven-plugin
                ${avro.version}
                
                    
                        generate-sources
                        
                            schema
                        
                        
                            ${project.basedir}/src/main/avro/
                            ${project.basedir}/src/main/java/
                        
                    
                
            
[...]

ProducerToOutfile.java

package producer;

import jakarta.xml.bind.DatatypeConverter;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import model.Test1;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class ProducerToOutfile {

    private static final Logger LOGGER = LogManager.getLogger();
    private static final String FILE_NAME_SERIALISED_NOTICES = "minimalVersion.avro";

    public static void main(String[] args) {
        ProducerToOutfile producer = new ProducerToOutfile();
        Test1 test1 = new Test1();
        test1.setUserid("myuser");
        try {
            producer.serialiseToByteStream(test1);
            LOGGER.info("### Test-Object has been serialised.");
        } catch (IOException ex) {
            LOGGER.error("### Failed to serialise");
            throw new RuntimeException(ex);
        }
    }

    private void serialiseToByteStream(Test1 test1) throws IOException {
        byte[] bytes = null;
        try {
            if (test1 != null) {
                ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
                BinaryEncoder binaryEncoder = EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null);
                DatumWriter notizDatumWriter = new SpecificDatumWriter(Test1.class);

                notizDatumWriter.write(test1, binaryEncoder);
                binaryEncoder.flush();
                byteArrayOutputStream.close();
                bytes = byteArrayOutputStream.toByteArray();
                LOGGER.info("serialized Test1 object ='{}'", DatatypeConverter.printHexBinary(bytes));

                Files.write(Paths.get(FILE_NAME_SERIALISED_NOTICES), bytes);
            }
        } catch (Exception e) {
            LOGGER.error("Unable to serialize Notizen ", e);
        }
    }
}

ConsumerFromFile.java

package consumer;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import model.Test1;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class ConsumerFromFile {

    private static final Logger LOGGER = LogManager.getLogger();
    private static final String FILE_NAME_SERIALISED_NOTICES = "minimalVersion.avro";
    //private static final String FILE_NAME_SERIALISED_NOTICES = "NotizByteStream.avro";
    public static void main(String[] args) throws IOException {
        ConsumerFromFile consumer = new ConsumerFromFile();
        consumer.deserializeByteStream(new File(FILE_NAME_SERIALISED_NOTICES));
    }

    public void deserializeByteStream(File file) {
        Test1   test1 = null;
        LOGGER.info("### Deserialising ByteStream Test1-object");

        try {
            byte[] bytes = Files.readAllBytes(Paths.get(file.getPath()));

            DatumReader datumReader = new SpecificDatumReader<>(Test1.class);
            Decoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
            test1 = datumReader.read(null, decoder);
            LOGGER.info("Deserialized byte stream data='{}'", test1.toString());

        } catch (Exception e) {
            LOGGER.error("### Unable to Deserialize bytes[] ", e);
        }
    }
}

Exception after attempting to read a V1-schema with the V2-schema compiled:

2024-05-06 12:01:41.400 [main] ERROR consumer.ConsumerFromFile - ### Unable to Deserialize bytes[] 
java.io.EOFException: null
    at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:514) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:155) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:465) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:282) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:136) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.specific.SpecificDatumReader.readRecord(SpecificDatumReader.java:123) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) ~[avro-1.11.3.jar:1.11.3]
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) ~[avro-1.11.3.jar:1.11.3]
    at consumer.ConsumerFromFile.deserializeByteStream(ConsumerFromFile.java:34) [classes/:?]
    at consumer.ConsumerFromFile.main(ConsumerFromFile.java:22) [classes/:?]

Solution:

The key to take advantage of Avros schema evolution is to have both the writers schema (the schema used for serialization) and the readers schema (the schema for deserialization) at the same time.

Then you can tell the reader which schema to use. Using the GenericDatumReader makes possible to access the variable in the byte stream and one can construct the model class manually. In case of complex model classes, it can be tiresome work.

Avro makes a ResolvingDecoder available. You can use it to construct the model class automatically. Nevertheless, you have to add any new fields to the end of the schema definition and may not change the order of the fields between the 2 schema versions to get it running as avro will fill in the fields in the order of occurrence. I think it is an acceptable compromise. If you remove fields from V1 to V2 it is no problem, only the existing fields will be used.

The method used for deserialization:

public void deserializeBytes(File file) {

        Test1 test1;

        String writerSchemaString = "schema as string ...";

        Schema writerSchema = new Schema.Parser().parse(writerSchemaString);

        Schema readerSchema = Test1.getClassSchema();

 

        try {

            byte[] bytes = Files.readAllBytes(Paths.get(file.getPath()));

 

            Decoder decoder = new DecoderFactory().binaryDecoder(new ByteArrayInputStream(bytes), null);

            ResolvingDecoder resolvingDecoder = new DecoderFactory().resolvingDecoder(writerSchema, readerSchema, decoder);

 

            // to check whether new fields have been added to the end of the schema

            Schema.Field[] fieldOrder = resolvingDecoder.readFieldOrderIfDiff();

            if(fieldOrder == null){

                DatumReader datumReader = new SpecificDatumReader<>(Test1.class);

                test1 = datumReader.read(null, resolvingDecoder);

            }else{

                LOGGER.info("Object has to be created by means of GeneriCDatumReader to deal with changed field order");

            }

        } catch (Exception e) {

            LOGGER.error("### Unable to Deserialize bytes[] ", e);

        }

    }

Tamas · Accepted Answer

I added the solution to the end of the question.

Avro, Schema Evolution, Backward-Compatibility

Answers (1)

Related Questions