user1950349
user1950349

Reputation: 5146

JsonParseException: Invalid UTF-8 middle byte 0x2d

I have a JSON String from which I am making a InputStream object as shown below and then I am making a GenericRecord object as I am trying to serialize my JSON object to Avro schema.

InputStream input = new ByteArrayInputStream(jsonString.getBytes());
DataInputStream din = new DataInputStream(input);

Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);

DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
// below line is throwing exception
GenericRecord datum = reader.read(null, decoder);   

Below is the exception I am getting:

org.codehaus.jackson.JsonParseException: Invalid UTF-8 middle byte 0x2d at [Source: java.io.DataInputStream@562aee31; line: 1, column: 74]

And here is the actual JSON string on which this exception is happening:

{"name":"car_test","attr_value":"2006|Renault|Megane II Coupé-Cabriolet|null|null|null|null|0|Wed Feb 03 10:00:59 GMT-07:00 2016|1|77|null|null|null|null","data_id":900}

I did some research and found out that I need to use ByteArrayInputStream with UTF-8 encodings as shown below:

InputStream input = new ByteArrayInputStream(jsonString.getBytes(StandardCharsets.UTF_8.displayName()));

But my question is what is the reason of this exception? And why it is happening on my above JSON String? I am just trying to understand why this exception is happening on my above JSON String. And using UTF-8 is the right fix for this?

What does this error means Invalid UTF-8 middle byte 0x2d?

Upvotes: 0

Views: 6110

Answers (2)

StaxMan
StaxMan

Reputation: 116472

Instead of using low-level Avro decoder functionality, you might be interested in a more convenient approach with Jackson Avro module:

https://github.com/FasterXML/jackson-dataformat-avro/issues

in which you can provide Avro Schema, but still work with POJOs. You can also use regular JSON Jackson databinding (different ObjectMapper) for JSON part, and read JSON into POJO, write POJO as Avro (or other combinations).

Upvotes: 0

wero
wero

Reputation: 32980

You start with a Java Unicode String jsonString.

You then convert it into a byte stream using String.getBytes(). Since you didn't specify the byte encoding the platform default is used which is most likely ISO 8859-1.

Now you parse the JSON from the (Data)InputStream. Now Avro seems to use UTF-8 to decode the bytes. And when it encounters the é (0x2d) it fails since it is not a valid UTF byte sequence.

So in the end it is a mismatch between the actual encoding (ISO 8859-1) and the expected encoding (UTF-8).

You can solve this like you did, or just avoid to go from string to bytes:

Decoder decoder = DecoderFactory.get().jsonDecoder(schema, jsonString);

Upvotes: 1

Related Questions