Reputation: 41
I have read and tried every example I could find for what seems like this straight forward problem. Assume there is a set of uncompressed text files and that I want to run a processing step on them and then output a set of compressed files with the results. To keep things simple, this example assumes cat
as the processing step.
Everything I found suggests this should work:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-Dmap.output.compress=true \
-Dmap.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-mapper /bin/cat \
-reducer NONE \
-input /path_to_uncompressed \
-output /path_to_compressed
The job runs normally, but outputs plain text files. I have tried varying the input file sizes, varying the codec (Snappy, BZip2, etc.), adding a reducer, setting mapred.output.compression.type (BLOCK, RECROD), etc. and the result is always the same. For reference, I am using a new install of CDH 4.1.2.
Upvotes: 4
Views: 8113
Reputation: 1
In Cloudera Manager, go to Services > Service mapreduce > Configuration > TaskTracker > Compression
Upvotes: 0
Reputation: 1
I work for Cloudera and came across this post. I just wanted to let you know that Cloudera Manager 4.5 (the version I confirmed) at least, has the option to NOT override the client config in addition to overriding the client config to true or to false. This makes it ideal as you can change that setting to allow the developer to choose whether or not to compress output. Hope that helps--I know this is a while ago now. :)
Upvotes: -2
Reputation: 10650
The followings work on Hadoop v1.0.0 :
This will produce a gzipped output:
hadoop jar /home/user/hadoop/path_to_jar/hadoop-streaming-1.0.0.jar \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.job.reduces=0 \
-mapper /bin/cat \
-input /user/hadoop/test/input/test.txt \
-output /user/hadoop/test/output
A block-compressed SequenceFile as output:
hadoop jar /home/user/hadoop/path_to_jar/hadoop-streaming-1.0.0.jar \
-D mapred.output.compress=true \
-D mapred.output.compression.type=BLOCK \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.job.reduces=0 \
-mapper /bin/cat \
-input /user/hadoop/test/input/test.txt \
-output /user/hadoop/test/output
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
Note the order of the parameters as well as the space between -D and the property name.
In case of YARN many properties have been deprecated (see the complete list here) . Therefore you have to do the following changes:
mapred.output.compress -> mapreduce.output.fileoutputformat.compress mapred.output.compression.codec -> mapreduce.output.fileoutputformat.compress.codec mapred.output.compression.type -> mapreduce.output.fileoutputformat.compress.type
Upvotes: 7