Reputation: 31

Is there any way to accelerate the caching process in pyspark?

I am trying to cache a Pyspark based data frame with 3 columns and 27 rows and this process is taking around 7-10 seconds.

Is there anyway to accelerate this job?

Thanks in advance!

Upvotes: 0

Answers (1)

Reputation: 349

You could try any of the below approaches:

coalesce your dataframe into a single partition for eg. df.coalesce(1) and then cache it
Since your dataframe is pretty tiny you could load it as a pandas dataframe, which will be in memory. toPandas() could help you in that regards. Don't forget use the arrow spark setting to make it faster. spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.enabled", "true")

Upvotes: 2