How to cache an augmented dataframe using Pyspark

Question

I have a large dataframe that I increase it in each transformation, I need to optimize the execution time. My question is to make a cache() after each transformation?

partitions=100
df = df.repartition(partitions, "uuid").cache()

df_aug = tran_1(df).cache()
df_aug = tran_2(df_aug).cache()
.
.
df_aug = tran_n(df_aug)

Daniel · Accepted Answer

Caching is not a magic bullet to improve performance - in you scenario it will likely slow everything down. It's a good idea to use caching when you access the same dataset multiple times (e.g. in data exploration). If you do multiple transformations on a single dataset it will cause serialization and storage while it will be read only once.

How to cache an augmented dataframe using Pyspark

Answers (2)

Related Questions