user300313
user300313

Reputation: 173

How to use CombineFileInputFormat on gzip files?

What is the best way to use CombineFileInputFormat on gzip files?

Upvotes: 1

Views: 266

Answers (1)

Vignesh I
Vignesh I

Reputation: 2221

This article will help you in building up your own Inputformat with the help of CombineFIleInputFOrmat to read and process gzip files. Below parts would give you an idea of what needs to be done.

Custom InputFormat:

Build your own custom combinefileinputformat almost same as that of combinefileinputformat. Key has to be our own writable class which would hold filename,offset and value would be the actual file content. Have to set issplittable to false(we dont want to split the file). set maxsplitsize to a value of your requirement. based on that value Combinefilerecordreader decides the number of splits and creates an instance for each split. You have to built you own custom recordreader by adding your decompression logic to it .

Custom RecordReader:

Custom Recordreader uses linereader and sets the key as filename,offset and value as actual file content. If the file is compressed it decompresses it and reads it. Here is the extract for that.

private void codecWiseDecompress(Configuration conf) throws IOException{

         CompressionCodecFactory factory = new CompressionCodecFactory(conf);
         CompressionCodec codec = factory.getCodec(path);

            if (codec == null) {
                System.err.println("No Codec Found For " + path);
                System.exit(1);
            }

            String outputUri = 
CompressionCodecFactory.removeSuffix(path.toString(), 
codec.getDefaultExtension());
            dPath = new Path(outputUri);

            InputStream in = null;
            OutputStream out = null;
            fs = this.path.getFileSystem(conf);

            try {
                in = codec.createInputStream(fs.open(path));
                out = fs.create(dPath);
                IOUtils.copyBytes(in, out, conf);
                } finally {
                    IOUtils.closeStream(in);
                    IOUtils.closeStream(out);
                    rlength = fs.getFileStatus(dPath).getLen();
                }
      }

Custom Writable Class:

A pair with filename,offset value

Upvotes: 2

Related Questions