Canburak Tümer
Canburak Tümer

Reputation: 1063

How to put file to HDFS with Snappy compression

I am working for a client where I should put some files to HDFS with Snappy compression. My problem is snappy codec is not defined in mapred-site.xml or hdfs-site.xml

Somehow I have to put files preferably using hdfs put command and they should be compressed. There is no chance to change configuration files since it is a production machine and other people using it actively.

Another suggested solution was to import files to HDFS without compression then create hive external tables with compression and use its source files while deleting the uncompressed files. But this is a long way to go and it is not guaranteed to work.

Any suggestions will be appreciated about using hdfs put with some kind of parameters to compress files.

Upvotes: 3

Views: 6407

Answers (3)

devashish kapadia
devashish kapadia

Reputation: 1

We solve this with some scenario

  1. If it is an rdd convert it to Data frame eg. RDD.toDF does not require parameters in case you wanna specifies the column name you can do it by rdd.toDF("c1","c2","c3")
  2. After converting to DF suppose you want to set it to a parquet file format with snappy compression you need to use sqlContext

    sqlContext.setConf("spark.parquet.commpression.codec","snappy")
    sqlContext.setConf("spark.parquet.commpression.codec","gzip") 
    

    for gzip compression

  3. After this use the following command XXDF.write.parquet("your_path") it will be saved with snappy compression

Upvotes: -1

dlamblin
dlamblin

Reputation: 45381

Say you have a Spark log file in hdfs that isn't compressed but you wanted to turn on spark.eventLog.compress true in the spark-defaults.conf and go ahead and compress the old logs. The map-reduce approach would make the most sense, but as a one off you can also use:

snzip -t hadoop-snappy local_file_will_end_in_dot_snappy

And then upload put it directly.

Installing snzip may look similar to this:

sudo yum install snappy snappy-devel
curl -O https://dl.bintray.com/kubo/generic/snzip-1.0.4.tar.gz
tar -zxvf snzip-1.0.4.tar.gz
cd snzip-1.0.4
./configure
make
sudo make install

Your round trip for a single file could be:

hdfs dfs -copyToLocal /var/log/spark/apps/application_1512353561403_50748_1 .
snzip -t hadoop-snappy application_1512353561403_50748_1
hdfs dfs -copyFromLocal application_1512353561403_50748_1.snappy /var/log/spark/apps/application_1512353561403_50748_1.snappy

Or with gohdfs:

hdfs cat /var/log/spark/apps/application_1512353561403_50748_1 \
| snzip -t hadoop-snappy > zzz
hdfs put zzz /var/log/spark/apps/application_1512353561403_50748_1.snappy
rm zzz

Upvotes: 1

Alex
Alex

Reputation: 8937

I suggest you to write map-reduce job to compress your data in hdfs. I don't know if there is a way to do automatic compress on hadoop put operation, but suppose it does not exist. One option is to put already compressed file:

snzip file.tar
hdfs dfs -put file.tar.sz /user/hduser/test/

Another way is to compress it inside mapreduce job. As an option you can use hadoop streaming jar for compressing you files within hdfs:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-Dmapred.reduce.tasks=0 \
-input <input-path> \
-output $OUTPUT \

Upvotes: 2

Related Questions