learningTheRopes
learningTheRopes

Reputation: 75

Writing custom java objects to Parquet

I have some custom java objects (which internally are composed of other custom objects). I wish to write these to HDFS in parquet format.

Even after a lot of searching, most suggestions seem to be around using a avro format and the internal AvroConverter from parquet to store the objects.

Seeing this here and here, it seems like I will have to write a custom WriterSupport to accomplish this.

Is there a better way to do this? Which is more optimal, writing custom objects directly or using something like Avro as a intermediate schema definition?

Upvotes: 4

Views: 6722

Answers (1)

Haojin
Haojin

Reputation: 334

You can use Avro reflection to get the schema. The code for that is like ReflectData.AllowNull.get().getSchema(CustomClass.class). I have an example Parquet demo code snippet.

Essentially the custom Java object writer is this:

    Path dataFile = new Path("/tmp/demo.snappy.parquet");

    // Write as Parquet file.
    try (ParquetWriter<Team> writer = AvroParquetWriter.<Team>builder(dataFile)
            .withSchema(ReflectData.AllowNull.get().getSchema(Team.class))
            .withDataModel(ReflectData.get())
            .withConf(new Configuration())
            .withCompressionCodec(SNAPPY)
            .withWriteMode(OVERWRITE)
            .build()) {
        for (Team team : teams) {
            writer.write(team);
        }
    }

You can replace the Team with your custom Java class. And you can see that the Team class includes a list of Person objects, which is similar to your requirement. And Avro can get the schema without any problem.

And if you want to write to HDFS, you may need to replace the path with HDFS format. But I didn't try it personally.

BTW, my code is inspired by this parquet-example code.

Upvotes: 7

Related Questions