Is it useful to store a data frame that does not perform actions but it will be used by others that do actions in Apache Spark?

Question

What happens when I cache a data frame in memory that will not perform actions but will be used by other data frames that will perform actions?

val sparkS: SparkSession = SparkSession.builder().getOrCreate()

val  dataFrameA : DataFrame  = sparkS.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(pathA)
      .filter( condition ).cache()



val  dataFrameB : DataFrame  = sparkS.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(pathB)

val  dataFrameC : DataFrame  = sparkS.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(pathC)


val resultB =  dataFrameB.crossJoin(dataFrameA)
resultB.count()
resultB.show()

val resutC =  dataFrameC.crossJoin(dataFrameA)
resutC.count()
resutC.show()

Will it cache the data frame A?

rbtadinada · Accepted Answer

Yes, it can be helpful to cache dataFrameA since it's output is used in multiple places.

But to take a step back, even if you don't call a method on dataFrameA, an action can still be performed on it. When you write Spark code, you're providing Spark with a set of "transformations" that eventually end in an "action". Spark will then take the steps of the transformations / actions you provide and translate that into an execution plan. It is not important on which objects you call which methods, because as long as the data is used in a computation, it will be present in the execution plan.

If you want to see how Spark is creating the execution plan, you can use the explain() method on your result DataFrame.

Is it useful to store a data frame that does not perform actions but it will be used by others that do actions in Apache Spark?

Answers (2)

Related Questions