Reputation: 75
I have some custom java objects (which internally are composed of other custom objects). I wish to write these to HDFS in parquet format.
Even after a lot of searching, most suggestions seem to be around using a avro format and the internal AvroConverter from parquet to store the objects.
Seeing this here and here, it seems like I will have to write a custom WriterSupport to accomplish this.
Is there a better way to do this? Which is more optimal, writing custom objects directly or using something like Avro as a intermediate schema definition?
Upvotes: 4
Views: 6722
Reputation: 334
You can use Avro reflection to get the schema. The code for that is like ReflectData.AllowNull.get().getSchema(CustomClass.class)
. I have an example Parquet demo code snippet.
Essentially the custom Java object writer is this:
Path dataFile = new Path("/tmp/demo.snappy.parquet");
// Write as Parquet file.
try (ParquetWriter<Team> writer = AvroParquetWriter.<Team>builder(dataFile)
.withSchema(ReflectData.AllowNull.get().getSchema(Team.class))
.withDataModel(ReflectData.get())
.withConf(new Configuration())
.withCompressionCodec(SNAPPY)
.withWriteMode(OVERWRITE)
.build()) {
for (Team team : teams) {
writer.write(team);
}
}
You can replace the Team
with your custom Java class. And you can see that the Team
class includes a list of Person
objects, which is similar to your requirement. And Avro can get the schema without any problem.
And if you want to write to HDFS, you may need to replace the path with HDFS format. But I didn't try it personally.
BTW, my code is inspired by this parquet-example code.
Upvotes: 7