Yash
Yash

Reputation: 1008

Partitioned parquet file takes more space and more time to query

Theoretically a Parquet file is expected to take less space than CSV and should provide results quicker. My experiment shows the opposite. https://github.com/yashgt/Samples/blob/master/Parquet.ipynb

I am converting the CSV file at to a Parquet file partitioned on the "city" field.

The activity takes 7m

The size of the Parquet folder is 48MB, while CSV is 2.5MB.

Querying the Parquet with a filtering criteria on "city" takes 350ms while the CSV takes 111ms.

The code is here https://github.com/yashgt/Samples/blob/master/Parquet.ipynb

The executed notebook in PDF form is here https://github.com/yashgt/Samples/raw/master/parquet.pdf

What am I doing wrong?

Upvotes: 0

Views: 581

Answers (1)

Iftach Schonbaum
Iftach Schonbaum

Reputation: 1

you should do this test on a much larger dataset to see the expected results. parquet is columnar storage for big data analytics. it has lots of metadata and in your case it might be not efficient compared with the content size itself so you dont have any benefits of the fact that you select only few columns or even all given that this is the dataset size compared to csv.

Upvotes: 0

Related Questions