Reputation: 385
While using spark RDD,i know that a new Stage is created everytime we have a ShuffleRDD,but is a new stage created when we have a multiple actions?
Example:
val rdd1 = sc.textFile("<some_path").keyBy(x=>x.split(",")(1))
val rdd2 = sc.textFile("<some_path").keyBy(x=>x.split(",")(1))
val rdd3 = rdd1.join(rdd2)
rdd3.filter(x=><somecondition1>).saveAsTextFile("location1")
rdd3.filter(x=><somecondition2>).saveAsTextFile("location2")
Now Stage1 will have tasks related to rdd1,rdd2 and rdd3,then Stage2 will have both the save actions?
Upvotes: 0
Views: 627
Reputation: 105
Stage2 only has one save operation.
In you code saveAsTextFile
is an action, which will invoke spark to calculate your rdd lineage. In another word, spark will only execute this code until it found saveAsTextFile
. Then stages and tasks will be created and submitted to executors.
Since your code has two saveAsTextFile
s and you never cached any intermediate rdds, rdd1, rdd2, rdd3 will be calculated twice in this case.
Stage is a concept within Job, one action invoke one job, so there is no way in which stage contains two actions.
Upvotes: 0
Reputation: 925
I actually asked a similar question a few months ago here.
In your case, rdd3 calls a transformation. So the actions in creating rdd1 and rdd2 will happen when you declare rdd3. Subsequent transformations happen at each save (filtering, specifically), but rdd1 and rdd2 are not run again as actions.
You would have a similar effect if you cached the data before running the saves.
I don't know which version of Spark you are using, but you can find relevant information from the documentation here. It is the same for 1.6+ at least.
Upvotes: 0