Parquet Size Difference

Question

Before posting this question, I went through similar questions e.g. this one but didn't find convincing response, subsequent to my issue.

Actually, in my existing ETL code, I have upgraded Spark 2.3.1 to Spark 3.4, along with hadoop,hive, yarn transitive dependencies.

Code is unchanged. Source XML feeds are same.

ETL is processing source XML and writing Parquet with LZO compression codec and putting them into S3 bucket.

My issue is this, that when I compare "Parquet Generated By Spark 2.3.1" with "Parquet Generated By Spark 3.4", then, I observed size difference of 2-to-3-KBs.

Please note Code is unchanged. Source XML feeds are same.

In some cases, either "Parquet Generated By Spark 2.3.1" is having 2-to-3-KBs size less than "Parquet Generated By Spark 3.4" and in some cases, it is opposite.

So, my confusion is this that, is this because LZO compression behaves differently among Spark 2.3.1 and Spark 3.4 ?

or is it due to the difference of Parquet Encoder between Spark 2.3.1 & Spark 3.4 ?

Please suggest?

Parquet Size Difference

Answers (0)

Related Questions