Adil Blanco
Adil Blanco

Reputation: 666

How to cache an augmented dataframe using Pyspark

I have a large dataframe that I increase it in each transformation, I need to optimize the execution time. My question is to make a cache() after each transformation?

partitions=100
df = df.repartition(partitions, "uuid").cache()

df_aug = tran_1(df).cache()
df_aug = tran_2(df_aug).cache()
.
.
df_aug = tran_n(df_aug)

Upvotes: 0

Views: 296

Answers (2)

anvy elizabeth
anvy elizabeth

Reputation: 130

The data will be cached only after an action.
And you are performing cache after all transformations.
This is not required. You can use  cache after first transformation.
Then apply a small action. Then use cached data  in subsequent transformations.

df.cache()
df.count()
df_aug = tran_1(df)
df_aug = tran_2(df_aug)

This approach will be more optimized.

Upvotes: 1

Daniel
Daniel

Reputation: 1242

Caching is not a magic bullet to improve performance - in you scenario it will likely slow everything down. It's a good idea to use caching when you access the same dataset multiple times (e.g. in data exploration). If you do multiple transformations on a single dataset it will cause serialization and storage while it will be read only once.

Upvotes: 1

Related Questions