Reputation: 1008
Theoretically a Parquet file is expected to take less space than CSV and should provide results quicker. My experiment shows the opposite. https://github.com/yashgt/Samples/blob/master/Parquet.ipynb
I am converting the CSV file at to a Parquet file partitioned on the "city" field.
The activity takes 7m
The size of the Parquet folder is 48MB, while CSV is 2.5MB.
Querying the Parquet with a filtering criteria on "city" takes 350ms while the CSV takes 111ms.
The code is here https://github.com/yashgt/Samples/blob/master/Parquet.ipynb
The executed notebook in PDF form is here https://github.com/yashgt/Samples/raw/master/parquet.pdf
What am I doing wrong?
Upvotes: 0
Views: 581
Reputation: 1
you should do this test on a much larger dataset to see the expected results. parquet is columnar storage for big data analytics. it has lots of metadata and in your case it might be not efficient compared with the content size itself so you dont have any benefits of the fact that you select only few columns or even all given that this is the dataset size compared to csv.
Upvotes: 0