Reputation: 153
I want to understand the behavior of cache in pyspark
Is df.cache()
anyway different that df = df.cache()
?
Is it absolutely necessary to unpersist the cached dataframe at the end of program execution, I understand its cleared by spark based on (Least Recently used mechanism),and what can be negative impacts if I don't unpersist a dataframe, I can think of out of memory issues but need inputs
Is it possible that when I use df = df.cache()
, the re-execution of the program uses the old cached data, rather than recalculating and overriding the cached dataframe ?
Upvotes: 1
Views: 483
Reputation: 15258
No need to unpersist at the end. stopping spark will clear the cached dataframes. You cannot persist from one spark execution to another one. If you want to "persist" from one Spark to another, the only solution is to physically save your data (write) and read them again at the next execution.
Upvotes: 1