Reputation: 810
Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count()
first execute and then warningsRDD.count()
, in sequence?
I'm also wondering if there is any point in using persist in this example.
Upvotes: 2
Views: 1101
Reputation: 330433
All standard RDD methods are blocking (with exception to AsyncRDDActions
) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures
) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache
it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
Upvotes: 2
Reputation: 1110
This will execute errorsRDD.count()
first then warningsRDD.count()
.
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.
Upvotes: 1