Barcelona
Barcelona

Reputation: 2132

How Spark realize which RDD operation need to be split into seperate Stage?

When Spark meets operation such as reduceByKey then a new Stage is created. How Spark realize which operation need to be split into seperate Stage like 'reduceByKey' operation? When I add new operation and I want it run in another Stage then how I implement it?

Upvotes: 3

Views: 951

Answers (3)

Balaji Reddy
Balaji Reddy

Reputation: 5710

There is something called pipelining. Pipelining is the process of collapsing of multiple RDDs into single stage when a RDD can be computed from their parents without any data movement(Shuffling). For more

Upvotes: 2

YoYo
YoYo

Reputation: 9415

Anything that causes a repartitioning of your data (redistributing data amongst nodes) will always create a new stage. Repartitioning happens mainly because a new Key is elected for your RDD data rows. Repartitioning also happens because of an explicit repartition of course.

You want to avoid reintroducing new stages if not needed, because that would implicitly mean that there was a reshuffle as well. Don't want to do that if not needed, because it is expensive.

The idea is that you partition your data in such a way that you use the maximum of your resources available (nodes and their cpu) - while also making sure that you do not introduce skew (where one node or cpu has more queued up work than another).

Upvotes: 1

Apurva Singh
Apurva Singh

Reputation: 5010

Let us do it with an example. This has a data set of city and temperature of each day for past 10 years e.g.:
New York -> [22.3, 22.4, 22.8, 32.3, ...........]
London -> [.................
Toronto -> [.................
My task is to convert this to fahrenheit and then find average of each city. This can be done as below:

  1. Data can be read on multiple nodes
  2. On each node, map operation can be done which converts celsius to fahrenheit
  3. Find average temperature The task 2 can be done on per node basis. But to calculate average, we need a shuffle. Reason is that data of New York could be on multiple servers. We need to get the data of New York on one machine to calculate average. So there will be an aggregate operation like groupByAggregate. Spark knows this is an shuffling operation. That is how it's code is written.. to group data by some key, the city in this case.
    See here for complete list of shuffling operations.
    Also, below you can see narrow transform (the map operation above.. step 2), and a wide transformation or shuffle:
    enter image description here

Upvotes: 3

Related Questions