How does Parquet file size changes with the count in Spark Dataset

Question

I came across a scenario where I had spark dataset with 24 columns, of which I was grouping by First 22 columns and summing up last two columns.

I removed the group by from the query and I have all 24 columns selected now. Initial count of the dataset was 79,304.

After I removed group by the count has increased to 138,204 which is understood because I have removed the group by.

But I was not clear with the behavior that The initial size of parquet file was 2.3MB but later it got reduced to 1.5MB . Can anyone please help me understand this.

Also not every time the size reduces, I had a similar scenario for 22 columns count before was 35,298,226 and after removing group by was 59,874,208 and here the size has increased from 466.5MB to 509.8MB

RefiPeretz · Accepted Answer

When dealing with parquet sizes it's not about the number of rows it's about the data it self. Parquet is columnar oriented format and therefore it store data column wise and compress the data column wise. Therefore it's not about the number of rows but rather the diversity of it's columns.

Parquet will do better compression as the diversity of the most diverse column in the table. So if you have one column dataframe it will be compress good as the distance between the values of the column.

How does Parquet file size changes with the count in Spark Dataset

Answers (1)

Related Questions