JimLohse
JimLohse

Reputation: 1304

Apache Spark what am I persisting here?

In this line, which RDD is being persisted? dropResultsN or dataSetN?

dropResultsN = dataSetN.map(s -> standin.call(s)).persist(StorageLevel.MEMORY_ONLY());

Question arises as a side issue from Apache Spark timing forEach operation on JavaRDD, where I am still looking for a good answer to the core question of how best to time RDD creation.

Upvotes: 0

Views: 70

Answers (2)

JimLohse
JimLohse

Reputation: 1304

I found a good example of this in Learning Spark by O'Reilly:

It's example 3-40. persist() in Scala (assuming Java is the same)

import org.apache.spark.storage.StorageLevel

val result = input.map( x => x*x )
result.persist(StorageLevel.[<your choice>][1])

NOTE in Learning Spark: Notice that we called persist() on the RDD before the first action. The persist() call on its own doesn't force evaluation.

MY NOTE that in this example the persist is on the next line, I think this is much more clear than my code in my question.

Upvotes: 0

jaco0646
jaco0646

Reputation: 17104

dropResultsN is the persisted RDD (which is the RDD produced by mapping dataSetN onto the method standin.call()).

Upvotes: 1

Related Questions