Spark cache function: Caching Job and Caching Stage

Question

I am new to spark and can use some guidance here. We have some basic code to read in a csv, cache it, and output it to parquet:

   1. val df=sparkSession.read.options(options).schema(schema).csv(path)
   2. val dfCached = df.withColumn()....orderBy(some Col).cache()
   3. dfCached.write.partitionBy(partitioning).parquet(outputPath)

AFAIK, once we invoke the parquet call (an action) the cache command should be executed to save the state of the DF before the action is applied.

In the spark UI I see:

A single staged job which is executing the cache call from #2 above
Then a Job which is executing the parquet call. This job has 2 stages; 1 which seems to be repeating the caching step and the second which performs the conversion to parquet. (see images below)

Why do I have both a caching Job and a caching Stage? I would expect to have only one or the other but it seems like we are caching twice here.

Spark cache function: Caching Job and Caching Stage

Answers (1)

Related Questions