Reputation: 668
Say I have a dataframe:
rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()
and I add a column
df = df.withColumn('c1', lit(0))
I want to use df
repeatedly. So do I need to re-cache()
the dataframe, or does Spark automatically do it for me?
Upvotes: 9
Views: 6681
Reputation: 11577
you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.
df = df.withColumn('c1', lit(0))
In the above statement a new dataframe is created and reassigned to variable df
. But this time only the new column is computed and the rest is retrieved from the cache.
Upvotes: 9