Partitioned parquet file takes more space and more time to query

Question

Theoretically a Parquet file is expected to take less space than CSV and should provide results quicker. My experiment shows the opposite. https://github.com/yashgt/Samples/blob/master/Parquet.ipynb

I am converting the CSV file at to a Parquet file partitioned on the "city" field.

The activity takes 7m

The size of the Parquet folder is 48MB, while CSV is 2.5MB.

Querying the Parquet with a filtering criteria on "city" takes 350ms while the CSV takes 111ms.

What am I doing wrong?

Answers (1)