Reputation: 666
I have a large dataframe that I increase it in each transformation, I need to optimize the execution time. My question is to make a cache() after each transformation?
partitions=100
df = df.repartition(partitions, "uuid").cache()
df_aug = tran_1(df).cache()
df_aug = tran_2(df_aug).cache()
.
.
df_aug = tran_n(df_aug)
Upvotes: 0
Views: 296
Reputation: 130
The data will be cached only after an action.
And you are performing cache after all transformations.
This is not required. You can use cache after first transformation.
Then apply a small action. Then use cached data in subsequent transformations.
df.cache()
df.count()
df_aug = tran_1(df)
df_aug = tran_2(df_aug)
This approach will be more optimized.
Upvotes: 1
Reputation: 1242
Caching is not a magic bullet to improve performance - in you scenario it will likely slow everything down. It's a good idea to use caching when you access the same dataset multiple times (e.g. in data exploration). If you do multiple transformations on a single dataset it will cause serialization and storage while it will be read only once.
Upvotes: 1