spark_dream
spark_dream

Reputation: 346

Parquet file compression

What would be the most optimized compression logic for Parquet files when using in Spark? Also what would be the approximate size of a 1gb parquet file after compression with each compression type?

Upvotes: 3

Views: 17661

Answers (4)

Kartikey Garg
Kartikey Garg

Reputation: 11

In my case compression seemed to have increased the file size. So,it has essentially made the file larger and unreadable. Parquet if not fully understood and used on small files can really suck. So, I would advice you to switch to avaro file format if you can.

Upvotes: 0

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

It depends on what kind of data you have; text usually compresses very well, random timestamp or float values not so well.

Have a look at this presentation from the latest Apache Big Data conference, especially slides 15-16 that shows the compression results per column on a test dataset.
[The rest of the pres. is about the theory & practice of compression applied to the Parquet internal structure]

Upvotes: 0

saranvisa
saranvisa

Reputation: 135

Refer here for Size Difference between all the compress & uncompress

  1. ORC: If you create ORC table in Hive you can't insert that from Impala, so you have to INSERT in Hive followed by REFRESH table_name in Impala
  2. Avro: To my knowledge, it is same as ORC
  3. Parquet: You can create a table in Hive and insert it from Impala

Upvotes: 0

Hiranya Deka
Hiranya Deka

Reputation: 241

You can try below steps to compress a parquet file in Spark:

Step 1:Set the compression type, configure the spark.sql.parquet.compression.codec property:

sqlContext.setConf("spark.sql.parquet.compression.codec","codec")

Step 2:Specify the codec values.The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.

Then create a dataframe,say Df from you data and save it using below command: Df.write.parquet("path_destination") If you check the destination folder now you will be albe to see that files have been stored with the compression type you have specified in the Step 2 above.

Please refer to the below link for more details: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/spark_parquet.html

Upvotes: -1

Related Questions