Somu Sinhhaa
Somu Sinhhaa

Reputation: 153

Pyspark:Need to understand the behaviour of cache in pyspark

I want to understand the behavior of cache in pyspark

  1. Is df.cache() anyway different that df = df.cache() ?

  2. Is it absolutely necessary to unpersist the cached dataframe at the end of program execution, I understand its cleared by spark based on (Least Recently used mechanism),and what can be negative impacts if I don't unpersist a dataframe, I can think of out of memory issues but need inputs

  3. Is it possible that when I use df = df.cache(), the re-execution of the program uses the old cached data, rather than recalculating and overriding the cached dataframe ?

Upvotes: 1

Views: 483

Answers (1)

Steven
Steven

Reputation: 15258

No need to unpersist at the end. stopping spark will clear the cached dataframes. You cannot persist from one spark execution to another one. If you want to "persist" from one Spark to another, the only solution is to physically save your data (write) and read them again at the next execution.

Upvotes: 1

Related Questions