Reputation: 69
I am currently using pyspark on a local windows 10 system. The pyspark code runs quite fast but takes a lot of time to save the pyspark dataframe to a csv format.
I am converting the pyspark dataframe to pandas and then saving it to a csv file. I have also tried using the write method to save the csv file.
Full_data.toPandas().to_csv("Level 1 - {} Hourly Avg Data.csv".format(yr), index=False)
Full_data.repartition(1).write.format('com.databricks.spark.csv').option("header", "true").save("Level 1 - {} Hourly Avg Data.csv".format(yr))
Both codes took about an hour to save the csv files. Is there a faster way to save csv files from pyspark dataframe?
Upvotes: 6
Views: 4635
Reputation: 5700
In both the reported examples you are reducing the level of parallelism.
In the 1st example (toPandas
) computationally speaking is like calling the function collect()
. You gather the dataframe into a collection into the driver making it single threaded.
In the 2nd example you are calling repartition(1)
which reduces the level of parallelism to 1, making it again single threaded.
Try instead to use repartition(2)
(or 4 or 8... according to the number of available execution threads of your machine). That should produce quicker results leveraging Spark parallelism (even though it will split the result into multiple files, in equal number of the repartition factor).
Upvotes: 8