Reputation: 4427
I have a Hadoop job in java, which has sequence output format:
job.setOutputFormatClass(SequenceFileOutputFormat.class);
I want to use Parquet format instead. I tried to set it in the naive way:
job.setOutputFormatClass(ParquetOutputFormat.class);
ParquetOutputFormat.setOutputPath(job, output);
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
ParquetOutputFormat.setCompressOutput(job, true);
But when in comes to writing job's result to disk, the bob fails:
Error: java.lang.NullPointerException: writeSupportClass should not be null
at parquet.Preconditions.checkNotNull(Preconditions.java:38)
at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)
It seems, that parquet needs a schema te be set, but I couldn't find ane manual or guide, how to do that in my case.
My Reducer
class tries to write down 3 long values on each line by using org.apache.hadoop.io.LongWritable
as a key and org.apache.mahout.cf.taste.hadoop.EntityEntityWritable
as a value.
How can I define a schema for that?
Upvotes: 3
Views: 1808
Reputation: 797
You have to specify a "parquet.hadoop.api.WriteSupport" impelementation for your job. (ex: "parquet.proto.ProtoWriteSupport" for protoBuf or "parquet.avro.AvroWriteSupport" for avro)
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
when using protoBuf, then specify protobufClass:
ProtoParquetOutputFormat.setProtobufClass(job, your-protobuf-class.class);
and when using avro, introduce schema like this:
AvroParquetOutputFormat.setSchema(job, your-avro-object.SCHEMA);
Upvotes: 3