Felipe Tapia
Felipe Tapia

Reputation: 369

Is it useful to store a data frame that does not perform actions but it will be used by others that do actions in Apache Spark?

What happens when I cache a data frame in memory that will not perform actions but will be used by other data frames that will perform actions?

val sparkS: SparkSession = SparkSession.builder().getOrCreate()

val  dataFrameA : DataFrame  = sparkS.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(pathA)
      .filter( condition ).cache()



val  dataFrameB : DataFrame  = sparkS.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(pathB)

val  dataFrameC : DataFrame  = sparkS.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(pathC)


val resultB =  dataFrameB.crossJoin(dataFrameA)
resultB.count()
resultB.show()

val resutC =  dataFrameC.crossJoin(dataFrameA)
resutC.count()
resutC.show()

Will it cache the data frame A?

Upvotes: 0

Views: 37

Answers (2)

Gaël J
Gaël J

Reputation: 15275

Yes.

If you call cache, it does cache the Dataframe.

Upvotes: 0

rbtadinada
rbtadinada

Reputation: 26

Yes, it can be helpful to cache dataFrameA since it's output is used in multiple places.

But to take a step back, even if you don't call a method on dataFrameA, an action can still be performed on it. When you write Spark code, you're providing Spark with a set of "transformations" that eventually end in an "action". Spark will then take the steps of the transformations / actions you provide and translate that into an execution plan. It is not important on which objects you call which methods, because as long as the data is used in a computation, it will be present in the execution plan.

If you want to see how Spark is creating the execution plan, you can use the explain() method on your result DataFrame.

Upvotes: 1

Related Questions