Pradeep
Pradeep

Reputation: 860

Hdfs text file to parquet format using map reduce job

I am trying to convert a hdfs text file to Parquet format using map reduce in java. Honestly I am starter of this and am unable to find any direct references.

Should the conversion be textfile --> avro ---> parquet.. ?

Upvotes: 3

Views: 3056

Answers (1)

rgettman
rgettman

Reputation: 178263

A text file (whether in HDFS or not) has no inherent file format. When using Map/Reduce, you will need an Avro Schema and a mapper function that will parse the input so that you can create an Avro GenericRecord.

Your mapper class will look something like this.

public class TextToAvroParquetMapper
        extends Mapper<LongWritable, Text, Void, GenericRecord> {
    private GenericRecord myGenericRecord = new GenericData.Record(mySchema);

    @Override
    protected void map(LongWritable key, Text value, Context context) {
          // Parse the value yourself here,
          // calling "put" on the Avro GenericRecord,
          // once for each field.  The GenericRecord
          // object is reused for every map call.
          context.write(null, myGenericRecord);
    }
}

The input key/value pair types are Hadoop's LongWritable and Text, and the output key/value pair types are Void (null keys) and the Avro GenericRecord itself.

In the run method, set the Job configuration as usual, including the input path, output path, and the mapper class. Set the number of reduce tasks to 0, because this is a map-only job.

job.setNumReduceTasks(0);

Set the output format class to Parquet's AvroParquetOutputFormat class, which converts the Avro GenericRecords you create into the Parquet columnar format. It needs to know your Avro Schema.

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, myAvroSchema);

Because AvroParquetOutputFormat translates an Avro GenericRecord into a Parquet Group object, you'll need to set the output value class to Group (and the output key class to Void, as the keys will all be null).

job.setOutputKeyClass(Void.class);
job.setOutputValueClass(Group.class);

Yes, the conversion is textfile -> Avro -> Parquet. Your map method controls the conversion from a textfile to Avro, and AvroParquetOutputFormat handles the conversion from Avro to Parquet.

Upvotes: 11

Related Questions