Reputation: 346
What would be the most optimized compression logic for Parquet files when using in Spark? Also what would be the approximate size of a 1gb parquet file after compression with each compression type?
Upvotes: 3
Views: 17661
Reputation: 11
In my case compression seemed to have increased the file size. So,it has essentially made the file larger and unreadable. Parquet if not fully understood and used on small files can really suck. So, I would advice you to switch to avaro file format if you can.
Upvotes: 0
Reputation: 9067
It depends on what kind of data you have; text usually compresses very well, random timestamp or float values not so well.
Have a look at this presentation from the latest Apache Big Data conference, especially slides 15-16 that shows the compression results per column on a test dataset.
[The rest of the pres. is about the theory & practice of compression applied to the Parquet internal structure]
Upvotes: 0
Reputation: 135
Refer here for Size Difference between all the compress & uncompress
Upvotes: 0
Reputation: 241
You can try below steps to compress a parquet file in Spark:
Step 1:Set the compression type, configure the spark.sql.parquet.compression.codec property:
sqlContext.setConf("spark.sql.parquet.compression.codec","codec")
Step 2:Specify the codec values.The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.
Then create a dataframe,say Df from you data and save it using below command: Df.write.parquet("path_destination") If you check the destination folder now you will be albe to see that files have been stored with the compression type you have specified in the Step 2 above.
Please refer to the below link for more details: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/spark_parquet.html
Upvotes: -1