Viacheslav Shalamov
Viacheslav Shalamov

Reputation: 4427

How to define parquet schema for ParquetOutputFormat for Hadoop job in java?

I have a Hadoop job in java, which has sequence output format:

job.setOutputFormatClass(SequenceFileOutputFormat.class);

I want to use Parquet format instead. I tried to set it in the naive way:

job.setOutputFormatClass(ParquetOutputFormat.class);
ParquetOutputFormat.setOutputPath(job, output);
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
ParquetOutputFormat.setCompressOutput(job, true);

But when in comes to writing job's result to disk, the bob fails:

Error: java.lang.NullPointerException: writeSupportClass should not be null
    at parquet.Preconditions.checkNotNull(Preconditions.java:38)
    at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)

It seems, that parquet needs a schema te be set, but I couldn't find ane manual or guide, how to do that in my case. My Reducer class tries to write down 3 long values on each line by using org.apache.hadoop.io.LongWritable as a key and org.apache.mahout.cf.taste.hadoop.EntityEntityWritable as a value.

How can I define a schema for that?

Upvotes: 3

Views: 1808

Answers (1)

m.semnani
m.semnani

Reputation: 797

You have to specify a "parquet.hadoop.api.WriteSupport" impelementation for your job. (ex: "parquet.proto.ProtoWriteSupport" for protoBuf or "parquet.avro.AvroWriteSupport" for avro)

ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);

when using protoBuf, then specify protobufClass:

 ProtoParquetOutputFormat.setProtobufClass(job, your-protobuf-class.class);

and when using avro, introduce schema like this:

AvroParquetOutputFormat.setSchema(job, your-avro-object.SCHEMA);

Upvotes: 3

Related Questions