Are two transformations on the same RDD executed in parallel in Apache Spark?

Question

Lets say we have the following Scala program:

val inputRDD = sc.textFile("log.txt")
inputRDD.persist()

val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)

println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())

We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.

When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.

My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?

I'm also wondering if there is any point in using persist in this example.

zero323 · Accepted Answer

All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.

Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.

Are two transformations on the same RDD executed in parallel in Apache Spark?

Answers (2)

Related Questions