Reputation: 53
I will copy a lot of big weblogs file compressed as gzip in hadoop. I will need run many map/reduce on these file.
To my understanding, only one Map/Reduce will be run by file. In my case, it's not acceptable because we need these job to complete as quickly as possible.
Is it common practice to split gzip file in smaller chunk (before copying them in hadoop or after) to be able to run as many map/reduce as possible ?
Thanks for your help.
Upvotes: 1
Views: 2443
Reputation: 686
you can use lzop to generate lzo compressed copies of your files - although the compression ration is lower than gzip, lzo decompresses very fast.
something like;
gunzip --stdout file.gz | lzop -ofile.lzo
should work.
Copy the lzo file into hdfs then install hadoop-lzo and use it to generate an index for the lzo file;
hadoop jar (path to hadoop-lzo jar) com.hadoop.compression.lzo.LzoIndexer file.lzo
(you can also use com.hadoop.compression.lzo.DistributedLzoIndexer if you like)
This will create an index for the lzo file.
Hadoop will then use (with the right input format) the index when generating splits for MapReduce jobs to distribute the .lzo compressed file to multiple mappers / reducers.
There is more detailed info here;
https://github.com/twitter/hadoop-lzo
and a fork of that repo here that addresses some issues;
https://github.com/kevinweil/hadoop-lzo
Upvotes: 2
Reputation: 112239
I'm still not clear on your question, so I will answer this question and you can let me know if I'm close:
How can use the map/reduce paradigm to decompress a large gzip file?
Unless the gzip file has been specially prepared for this, it is not possible to map out the decompression job. Decompression must be done serially. Even though bzip2 compressed data is already in separately decompressible blocks, you can't find the blocks without having decompressed the whole thing already, serially, to point to them, which probably defeats the purpose.
You mention a "container" format for LZO, which if I understand you correctly, would work as well for gzip and bzip2.
For any of these formats, you can prepare a gzip stream for parallel decompression by compressing in pieces. E.g. a megabyte or a few megabytes for each piece, so as to not significantly degrade compression, and maintaining an index to those pieces constructed at the time of compression and transmitted or stored along with the compressed data file.
A concatenation of gzip streams is itself a valid gzip stream that decompresses to the concatenation of the decompressions of the individual streams. The same is true for the bzip2 format. For bzip2 the pieces should be a multiple of 900K so as to not have partial blocks, which are less efficient in compression ratio.
You can then construct such a gzip or bzip2 file and keep a list of file offsets of the start of each gzip or bzip2 stream within. Then you can map out those pieces, where the reduce step would simply concatenate the uncompressed results in the correct order.
Upvotes: 1