Cache() in Pyspark Dataframe

Question

I have a dataframe and I need to include several transformations on it. I thought of performing all the actions in the same dataframe. So if I need to use cache Should I cache the dataframe after every action performed in it ?

df=df.selectExpr("*","explode(area)").select("*","col.*").drop(*['col','area'])
df.cache()
df=df.withColumn('full_name',f.concat(f.col('first_name'),f.lit(' '),f.col('last_name'))).drop('first_name','last_name')
df.cache()
df=df.withColumn("cleaned_map", regexp_replace("date", "[^0-9T]", "")).withColumn("date_type", to_date("cleaned_map", "ddMMyyyy")).drop('date','cleaned_map')
df.cache()
df=df.filter(df.date_type.isNotNull())
df.show()

Should I add like this or caching once is enough ?

Also I want to know if I use multiple dataframes instead of one for the above code should I include cache at every transformation. Thanks a lot !

dsk · Accepted Answer

The answer is simple, when you do df = df.cache() or df.cache() both are locates to an RDD in the granular level. Now , once you are performing any operation the it will create a new RDD, so this is pretty evident that will not be cached, so having said that it's up to you which DF/RDD you want to cache().Also, try avoiding try unnecessary caching as the data will be persisted in memory.

Below is the source code for cache() from spark documentation

def cache(self): 
    """ 
    Persist this RDD with the default storage level (C{MEMORY_ONLY_SER}). 
    """ 
    self.is_cached = True 
    self.persist(StorageLevel.MEMORY_ONLY_SER) 
    return self

Cache() in Pyspark Dataframe

Answers (1)

Related Questions