Reputation: 5146
I have a JSON String from which I am making a InputStream object
as shown below and then I am making a GenericRecord
object as I am trying to serialize my JSON object to Avro schema.
InputStream input = new ByteArrayInputStream(jsonString.getBytes());
DataInputStream din = new DataInputStream(input);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
// below line is throwing exception
GenericRecord datum = reader.read(null, decoder);
Below is the exception I am getting:
org.codehaus.jackson.JsonParseException: Invalid UTF-8 middle byte 0x2d at [Source: java.io.DataInputStream@562aee31; line: 1, column: 74]
And here is the actual JSON string on which this exception is happening:
{"name":"car_test","attr_value":"2006|Renault|Megane II Coupé-Cabriolet|null|null|null|null|0|Wed Feb 03 10:00:59 GMT-07:00 2016|1|77|null|null|null|null","data_id":900}
I did some research and found out that I need to use ByteArrayInputStream
with UTF-8 encodings as shown below:
InputStream input = new ByteArrayInputStream(jsonString.getBytes(StandardCharsets.UTF_8.displayName()));
But my question is what is the reason of this exception? And why it is happening on my above JSON String? I am just trying to understand why this exception is happening on my above JSON String. And using UTF-8
is the right fix for this?
What does this error means Invalid UTF-8 middle byte 0x2d
?
Upvotes: 0
Views: 6110
Reputation: 116472
Instead of using low-level Avro decoder functionality, you might be interested in a more convenient approach with Jackson Avro module:
https://github.com/FasterXML/jackson-dataformat-avro/issues
in which you can provide Avro Schema, but still work with POJOs.
You can also use regular JSON Jackson databinding (different ObjectMapper
) for JSON part, and read JSON into POJO, write POJO as Avro (or other combinations).
Upvotes: 0
Reputation: 32980
You start with a Java Unicode String jsonString
.
You then convert it into a byte stream using String.getBytes()
. Since you didn't specify the byte encoding the platform default is used which is most likely ISO 8859-1.
Now you parse the JSON from the (Data)InputStream
. Now Avro seems to use UTF-8 to decode the bytes. And when it encounters the é
(0x2d) it fails since it is not a valid UTF byte sequence.
So in the end it is a mismatch between the actual encoding (ISO 8859-1) and the expected encoding (UTF-8).
You can solve this like you did, or just avoid to go from string to bytes:
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, jsonString);
Upvotes: 1