Reputation: 3973
I'm running a Hadoop job on a bunch of gzipped input files. Hadoop should handle this easily... mapreduce in java - gzip input files
Unfortunately, in my case, the input files don't have a .gz
extension. I'm using CombineTextInputFormatClass
, which runs my job fine if I point it at non-gzipped files, but I basically just get a bunch of garbage if I point it at the gzipped ones.
I've tried searching for quite some time, but the only thing I've turned up is somebody else asking the same question as I have, with no answer... How to force Hadoop to unzip inputs regadless of their extension?
Anybody got anything?
Upvotes: 2
Views: 1006
Reputation: 3973
Went digging in the source and built a solution for this...
You need to modify the source of the LineRecordReader
class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory
and calls getCodec
which parses a file path for its extension. You can instead use getCodecByClassName
to obtain any codec you want.
You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/
Upvotes: 2
Reputation: 7138
First gzip files are not splittable. So the result is that your map reduce will not make use of block size while splitting.
Map reduce does not perform splitting when it sees the file extension. Sadly in your case, you are saying that the extension is not .gz. So I am afraid Map reduce is unable to understand how to split the data.
So even though there is an option to know the extension, you would not get good performance. So may be why not uncompress and then provide the data to map reduce, rather than force the map reduce to use compressed format with reduced performance.
Upvotes: -1