Reputation: 1063
I am working for a client where I should put some files to HDFS with Snappy compression. My problem is snappy codec is not defined in mapred-site.xml
or hdfs-site.xml
Somehow I have to put files preferably using hdfs put
command and they should be compressed. There is no chance to change configuration files since it is a production machine and other people using it actively.
Another suggested solution was to import files to HDFS without compression then create hive external tables with compression and use its source files while deleting the uncompressed files. But this is a long way to go and it is not guaranteed to work.
Any suggestions will be appreciated about using hdfs put
with some kind of parameters to compress files.
Upvotes: 3
Views: 6407
Reputation: 1
We solve this with some scenario
RDD.toDF
does not require parameters in case you wanna specifies the column name
you can do it by rdd.toDF("c1","c2","c3")
After converting to DF suppose you want to set it to a parquet file format with snappy compression you need to use sqlContext
sqlContext.setConf("spark.parquet.commpression.codec","snappy")
sqlContext.setConf("spark.parquet.commpression.codec","gzip")
for gzip compression
After this use the following command
XXDF.write.parquet("your_path")
it will be saved with snappy compression
Upvotes: -1
Reputation: 45381
Say you have a Spark log file in hdfs that isn't compressed but you wanted to turn on spark.eventLog.compress true
in the spark-defaults.conf
and go ahead and compress the old logs. The map-reduce approach would make the most sense, but as a one off you can also use:
snzip -t hadoop-snappy local_file_will_end_in_dot_snappy
And then upload put it directly.
Installing snzip may look similar to this:
sudo yum install snappy snappy-devel
curl -O https://dl.bintray.com/kubo/generic/snzip-1.0.4.tar.gz
tar -zxvf snzip-1.0.4.tar.gz
cd snzip-1.0.4
./configure
make
sudo make install
Your round trip for a single file could be:
hdfs dfs -copyToLocal /var/log/spark/apps/application_1512353561403_50748_1 .
snzip -t hadoop-snappy application_1512353561403_50748_1
hdfs dfs -copyFromLocal application_1512353561403_50748_1.snappy /var/log/spark/apps/application_1512353561403_50748_1.snappy
Or with gohdfs:
hdfs cat /var/log/spark/apps/application_1512353561403_50748_1 \
| snzip -t hadoop-snappy > zzz
hdfs put zzz /var/log/spark/apps/application_1512353561403_50748_1.snappy
rm zzz
Upvotes: 1
Reputation: 8937
I suggest you to write map-reduce job to compress your data in hdfs. I don't know if there is a way to do automatic compress on hadoop put operation, but suppose it does not exist. One option is to put already compressed file:
snzip file.tar
hdfs dfs -put file.tar.sz /user/hduser/test/
Another way is to compress it inside mapreduce job. As an option you can use hadoop streaming jar for compressing you files within hdfs:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-Dmapred.reduce.tasks=0 \
-input <input-path> \
-output $OUTPUT \
Upvotes: 2