osk
osk

Reputation: 810

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:

val inputRDD = sc.textFile("log.txt")
inputRDD.persist()

val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)

println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())

We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.

When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.

My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?

I'm also wondering if there is any point in using persist in this example.

Upvotes: 2

Views: 1101

Answers (2)

zero323
zero323

Reputation: 330433

All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.

Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.

Upvotes: 2

Thang Nguyen
Thang Nguyen

Reputation: 1110

This will execute errorsRDD.count() first then warningsRDD.count(). The point of using persist here is when the first count is executed, inputRDD will be in memory. The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Upvotes: 1

Related Questions