Reading from parquet is slower than csv -pyspark

Question

I'm new to pySpark. I'm trying to optimize my program's execution time on local mode. I've read somewhere that saving a dataframe to parquet and then loading it again, before doing any transformations on it , reduces execution time although I don't understand why. Also, in the process of doing that I noticed that it takes longer to load from a parquet than a csv. I repartitioned and coalsced my data to minimize the loading times in the example below (8 partitions - data size: roughly 400kB).

My main question is: Are there any guidelines on how to increase my program's performance and why does parquet take longer time to load than csv?

here is an example

This is my config:

spark.driver.bindAddress: localhost
spark.ui.port: 4040
spark.driver.memory: "12g"
spark.driver.memoryOverhead: 4096
spark.sql.shuffle.partitions: 8
spark.default.parallelism: 8
spark.master: "local[8]"
# spark.sql.analyzer.failAmbiguousSelfJoin: false

pltc · Accepted Answer

You can't tell reading from parquet is slower than reading from CSV just by one run with a very small dataset. The result needs to be consistent and fair, i.e: when you running on your local computer, you need to make sure no other tasks are running (and taking memory) while you're reading (so, what if OS update is running on the background while you're reading from parquet). My suggestion is benchmark it over and over, using a bigger dataset if possible, running in a dedicated machine without disrupting by any other tasks etc

Reading from parquet is slower than csv -pyspark

Answers (1)

Related Questions