John Chrysostom
John Chrysostom

Reputation: 3973

Use gzip input codec on files without .gz extension in hadoop

I'm running a Hadoop job on a bunch of gzipped input files. Hadoop should handle this easily... mapreduce in java - gzip input files

Unfortunately, in my case, the input files don't have a .gz extension. I'm using CombineTextInputFormatClass, which runs my job fine if I point it at non-gzipped files, but I basically just get a bunch of garbage if I point it at the gzipped ones.

I've tried searching for quite some time, but the only thing I've turned up is somebody else asking the same question as I have, with no answer... How to force Hadoop to unzip inputs regadless of their extension?

Anybody got anything?

Upvotes: 2

Views: 1006

Answers (2)

John Chrysostom
John Chrysostom

Reputation: 3973

Went digging in the source and built a solution for this...

You need to modify the source of the LineRecordReader class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory and calls getCodec which parses a file path for its extension. You can instead use getCodecByClassName to obtain any codec you want.

You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/

Upvotes: 2

Ramzy
Ramzy

Reputation: 7138

First gzip files are not splittable. So the result is that your map reduce will not make use of block size while splitting.

Map reduce does not perform splitting when it sees the file extension. Sadly in your case, you are saying that the extension is not .gz. So I am afraid Map reduce is unable to understand how to split the data.

So even though there is an option to know the extension, you would not get good performance. So may be why not uncompress and then provide the data to map reduce, rather than force the map reduce to use compressed format with reduced performance.

Upvotes: -1

Related Questions