Reputation: 27385
I'm learning spark and have a question about job scheduling and shuffle dependencies. Here the DAG I found there:
As we can see on the Stage 33
we have multiple operations: groupBy
, join
, groupBy
, join
. The question is I don't quite understand why two groupBy operations were put into the same stage? I thought groupBy
requires shuffling and we the DAGScheduler
should split Stage 33
into 2 stages containing the single groupBy
and join
.
Upvotes: 3
Views: 3626
Reputation:
Shuffling
is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines).
In your case, shuffling happens between parallelize steps - Stages 30, 31, 32 as input and final Stage 33 as a destination in pairs.
Avoid shuffling at all cost. Think about ways to leverage existing partitions, or using Broadcast
variables and try to reduce data transfer.
More about shuffling in Spark you can read here.
Upvotes: 1
Reputation: 41
Shuffles here are represented as boundaries between stages:
Once data has been shuffled, and all shuffled RDDs use the same partitioner final join
is 1-1 dependency (if all parts have been executed in the same action, it is also local due to collocation) and doesn't require additional shuffle stage.
Upvotes: 4