How to define parquet schema for ParquetOutputFormat for Hadoop job in java?

Question

I have a Hadoop job in java, which has sequence output format:

job.setOutputFormatClass(SequenceFileOutputFormat.class);

I want to use Parquet format instead. I tried to set it in the naive way:

job.setOutputFormatClass(ParquetOutputFormat.class);
ParquetOutputFormat.setOutputPath(job, output);
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
ParquetOutputFormat.setCompressOutput(job, true);

But when in comes to writing job's result to disk, the bob fails:

Error: java.lang.NullPointerException: writeSupportClass should not be null
    at parquet.Preconditions.checkNotNull(Preconditions.java:38)
    at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)

It seems, that parquet needs a schema te be set, but I couldn't find ane manual or guide, how to do that in my case. My Reducer class tries to write down 3 long values on each line by using org.apache.hadoop.io.LongWritable as a key and org.apache.mahout.cf.taste.hadoop.EntityEntityWritable as a value.

How can I define a schema for that?

How to define parquet schema for ParquetOutputFormat for Hadoop job in java?

Answers (1)

Related Questions