Why does a sorted parquet file have a larger size than a non-sorted one?

Question

I have a dataframe created as follows :

expanded_1 = pd.DataFrame({"Point": [random.choice(points) for x in range(30000000)], 
                     "Price": [random.choice(prices) for x in range(30000000)]
                    })

that i stored as a parquet file, the size of this on disk is 90.2 MB.

Post researching on how compression is being done with parquet, i sorted the values by Point so that similar data can is kept together with the understanding that this will allow the default parquet compression technique to be more efficient. However the results I saw were quite opposite. On running the following :

expanded_1.sort_values(by=['Point']).to_parquet('/expanded_1_sorted.parquet')

the resulting file was 211 MB in size.

What is causing the size increase ?

no comment · Accepted Answer

I think it's the scrambled index, and reset_index(drop=True) seems to fix it. Instead of much bigger it became much smaller (half the unsorted original) when I tested with points = prices = range(1000).

Or as @0x26res points out, .sort_values(by=['Point'], ignore_index=True) is more efficient. No need to fix what you don't break. Result is the same.

Why does a sorted parquet file have a larger size than a non-sorted one?

Answers (1)

Related Questions