Reputation: 860
I am trying to convert a hdfs text file to Parquet format using map reduce in java. Honestly I am starter of this and am unable to find any direct references.
Should the conversion be textfile --> avro ---> parquet.. ?
Upvotes: 3
Views: 3056
Reputation: 178263
A text file (whether in HDFS or not) has no inherent file format. When using Map/Reduce, you will need an Avro Schema and a mapper function that will parse the input so that you can create an Avro GenericRecord
.
Your mapper class will look something like this.
public class TextToAvroParquetMapper
extends Mapper<LongWritable, Text, Void, GenericRecord> {
private GenericRecord myGenericRecord = new GenericData.Record(mySchema);
@Override
protected void map(LongWritable key, Text value, Context context) {
// Parse the value yourself here,
// calling "put" on the Avro GenericRecord,
// once for each field. The GenericRecord
// object is reused for every map call.
context.write(null, myGenericRecord);
}
}
The input key/value pair types are Hadoop's LongWritable
and Text
, and the output key/value pair types are Void
(null keys) and the Avro GenericRecord
itself.
In the run
method, set the Job
configuration as usual, including the input path, output path, and the mapper class. Set the number of reduce tasks to 0, because this is a map-only job.
job.setNumReduceTasks(0);
Set the output format class to Parquet
's AvroParquetOutputFormat
class, which converts the Avro
GenericRecord
s you create into the Parquet
columnar format. It needs to know your Avro Schema
.
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, myAvroSchema);
Because AvroParquetOutputFormat
translates an Avro GenericRecord
into a Parquet Group
object, you'll need to set the output value class to Group
(and the output key class to Void
, as the keys will all be null
).
job.setOutputKeyClass(Void.class);
job.setOutputValueClass(Group.class);
Yes, the conversion is textfile -> Avro -> Parquet. Your map
method controls the conversion from a textfile to Avro, and AvroParquetOutputFormat
handles the conversion from Avro to Parquet.
Upvotes: 11