Reputation: 2132
When Spark meets operation such as reduceByKey then a new Stage is created. How Spark realize which operation need to be split into seperate Stage like 'reduceByKey' operation? When I add new operation and I want it run in another Stage then how I implement it?
Upvotes: 3
Views: 951
Reputation: 5710
There is something called pipelining. Pipelining is the process of collapsing of multiple RDDs into single stage when a RDD can be computed from their parents without any data movement(Shuffling). For more
Upvotes: 2
Reputation: 9415
Anything that causes a repartitioning of your data (redistributing data amongst nodes) will always create a new stage. Repartitioning happens mainly because a new Key is elected for your RDD data rows. Repartitioning also happens because of an explicit repartition of course.
You want to avoid reintroducing new stages if not needed, because that would implicitly mean that there was a reshuffle as well. Don't want to do that if not needed, because it is expensive.
The idea is that you partition your data in such a way that you use the maximum of your resources available (nodes and their cpu) - while also making sure that you do not introduce skew (where one node or cpu has more queued up work than another).
Upvotes: 1
Reputation: 5010
Let us do it with an example. This has a data set of city and temperature of each day for past 10 years e.g.:
New York -> [22.3, 22.4, 22.8, 32.3, ...........]
London -> [.................
Toronto -> [.................
My task is to convert this to fahrenheit and then find average of each city. This can be done as below:
Upvotes: 3