modular
modular

Reputation: 1099

Easiest efficient way to zip output of hadoop mapreduce

I can compress mapreduce output to gzip with

"mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"

Will it be straightforward to implement zip codec for hadoop? Zip is container, but I need only one file per archive, so would it be easy to create ZipCodec with CompressionCodec interface?

Or, maybe there is an efficient way to convert gz files to zips, since they can use same deflate algorithm?

Upvotes: 3

Views: 1595

Answers (1)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

No big deal, you can wrap a java.util.zip.ZipOutputStream.

You can do this by implementing your own codec, which is done by extending org.apache.hadoop.io.compress.DefaultCodec.

In this codec you wrap the java zip streams by extending org.apache.hadoop.io.compress.CompressorStream respectively org.apache.hadoop.io.compress.DecompressorStream.

In the end you have to override the createInputStream and createOutputStream method and return a new instance of the wrapped streams there.

Still a bit of coding, I'm pretty sure there must be an already existing implementation somewhere (I may recall it also was in a Hadoop release years ago).

Upvotes: 3

Related Questions