Reputation: 369
What happens when I cache a data frame in memory that will not perform actions but will be used by other data frames that will perform actions?
val sparkS: SparkSession = SparkSession.builder().getOrCreate()
val dataFrameA : DataFrame = sparkS.read
.option("header", "true")
.option("inferSchema", "true")
.csv(pathA)
.filter( condition ).cache()
val dataFrameB : DataFrame = sparkS.read
.option("header", "true")
.option("inferSchema", "true")
.csv(pathB)
val dataFrameC : DataFrame = sparkS.read
.option("header", "true")
.option("inferSchema", "true")
.csv(pathC)
val resultB = dataFrameB.crossJoin(dataFrameA)
resultB.count()
resultB.show()
val resutC = dataFrameC.crossJoin(dataFrameA)
resutC.count()
resutC.show()
Will it cache the data frame A?
Upvotes: 0
Views: 37
Reputation: 26
Yes, it can be helpful to cache dataFrameA
since it's output is used in multiple places.
But to take a step back, even if you don't call a method on dataFrameA
, an action can still be performed on it. When you write Spark code, you're providing Spark with a set of "transformations" that eventually end in an "action". Spark will then take the steps of the transformations / actions you provide and translate that into an execution plan. It is not important on which objects you call which methods, because as long as the data is used in a computation, it will be present in the execution plan.
If you want to see how Spark is creating the execution plan, you can use the explain()
method on your result DataFrame.
Upvotes: 1