Reputation: 1093
I have this scenario. We have to provide a functionality that takes a whatever type of RDD
, with the generics notation you could say RDD[T]
and serialize and save to HDFS using Avro DataFile
.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)]
o RDD[(String, Date, OtherBusinessObject)]
.
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.
Upvotes: 0
Views: 787
Reputation: 7056
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map
to transform all of the elements in the RDD to GenericRecord
s and use GenericDatumWriter
to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.
Upvotes: 1