ftw
ftw

Reputation: 385

Spark DAG Number of Stages

While using spark RDD,i know that a new Stage is created everytime we have a ShuffleRDD,but is a new stage created when we have a multiple actions?

Example:

val rdd1 = sc.textFile("<some_path").keyBy(x=>x.split(",")(1))

val rdd2 = sc.textFile("<some_path").keyBy(x=>x.split(",")(1))

val rdd3 = rdd1.join(rdd2)

rdd3.filter(x=><somecondition1>).saveAsTextFile("location1")
rdd3.filter(x=><somecondition2>).saveAsTextFile("location2")

Now Stage1 will have tasks related to rdd1,rdd2 and rdd3,then Stage2 will have both the save actions?

Upvotes: 0

Views: 627

Answers (2)

mychaint
mychaint

Reputation: 105

Stage2 only has one save operation.

In you code saveAsTextFile is an action, which will invoke spark to calculate your rdd lineage. In another word, spark will only execute this code until it found saveAsTextFile. Then stages and tasks will be created and submitted to executors.

Since your code has two saveAsTextFiles and you never cached any intermediate rdds, rdd1, rdd2, rdd3 will be calculated twice in this case.

Stage is a concept within Job, one action invoke one job, so there is no way in which stage contains two actions.

Upvotes: 0

tadamhicks
tadamhicks

Reputation: 925

I actually asked a similar question a few months ago here.

In your case, rdd3 calls a transformation. So the actions in creating rdd1 and rdd2 will happen when you declare rdd3. Subsequent transformations happen at each save (filtering, specifically), but rdd1 and rdd2 are not run again as actions.

You would have a similar effect if you cached the data before running the saves.

I don't know which version of Spark you are using, but you can find relevant information from the documentation here. It is the same for 1.6+ at least.

Upvotes: 0

Related Questions