Robert Wilkinson
Robert Wilkinson

Reputation: 41

How to get compressed (text) output from a streaming Hadoop job

I have read and tried every example I could find for what seems like this straight forward problem. Assume there is a set of uncompressed text files and that I want to run a processing step on them and then output a set of compressed files with the results. To keep things simple, this example assumes cat as the processing step.

Everything I found suggests this should work:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -Dmap.output.compress=true \
    -Dmap.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -mapper /bin/cat \
    -reducer NONE \
    -input /path_to_uncompressed \
    -output /path_to_compressed

The job runs normally, but outputs plain text files. I have tried varying the input file sizes, varying the codec (Snappy, BZip2, etc.), adding a reducer, setting mapred.output.compression.type (BLOCK, RECROD), etc. and the result is always the same. For reference, I am using a new install of CDH 4.1.2.

Upvotes: 4

Views: 8113

Answers (3)

dslee
dslee

Reputation: 1

In Cloudera Manager, go to Services > Service mapreduce > Configuration > TaskTracker > Compression

  • Compress MapReduce Job Output (Client Override) : Don't override client configuration

Upvotes: 0

user2359936
user2359936

Reputation: 1

I work for Cloudera and came across this post. I just wanted to let you know that Cloudera Manager 4.5 (the version I confirmed) at least, has the option to NOT override the client config in addition to overriding the client config to true or to false. This makes it ideal as you can change that setting to allow the developer to choose whether or not to compress output. Hope that helps--I know this is a while ago now. :)

Upvotes: -2

Lorand Bendig
Lorand Bendig

Reputation: 10650

The followings work on Hadoop v1.0.0 :

This will produce a gzipped output:

hadoop jar /home/user/hadoop/path_to_jar/hadoop-streaming-1.0.0.jar \
    -D mapred.output.compress=true \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -D mapreduce.job.reduces=0 \
    -mapper /bin/cat \
    -input /user/hadoop/test/input/test.txt \
    -output /user/hadoop/test/output

A block-compressed SequenceFile as output:

hadoop jar /home/user/hadoop/path_to_jar/hadoop-streaming-1.0.0.jar \
    -D mapred.output.compress=true \
    -D mapred.output.compression.type=BLOCK \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -D mapreduce.job.reduces=0 \
    -mapper /bin/cat \
    -input /user/hadoop/test/input/test.txt \
    -output /user/hadoop/test/output
    -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat

Note the order of the parameters as well as the space between -D and the property name.

In case of YARN many properties have been deprecated (see the complete list here) . Therefore you have to do the following changes:

mapred.output.compress -> mapreduce.output.fileoutputformat.compress mapred.output.compression.codec -> mapreduce.output.fileoutputformat.compress.codec mapred.output.compression.type -> mapreduce.output.fileoutputformat.compress.type

Upvotes: 7

Related Questions