Jay
Jay

Reputation: 668

PySpark: do I need to re-cache a DataFrame?

Say I have a dataframe:

rdd = sc.textFile(file)
df = sqlContext.createDataFrame(rdd)
df.cache()

and I add a column

df = df.withColumn('c1', lit(0))

I want to use df repeatedly. So do I need to re-cache() the dataframe, or does Spark automatically do it for me?

Upvotes: 9

Views: 6681

Answers (1)

rogue-one
rogue-one

Reputation: 11577

you will have to re-cache the dataframe again everytime you manipulate/change the dataframe. However the entire dataframe doesn't have to be recomputed.

df = df.withColumn('c1', lit(0))

In the above statement a new dataframe is created and reassigned to variable df. But this time only the new column is computed and the rest is retrieved from the cache.

Upvotes: 9

Related Questions